potiuk commented on PR #24664: URL: https://github.com/apache/airflow/pull/24664#issuecomment-1168093297
> @potiuk, just want to double check there is no security/resource considerations here with us spinning up ARM machines for normal, non-committer PRs, right? Good questions. I :heart: that you asked them :). Here are my explanations and way how it is protected, but I would love a challenge/review and confirming that my thinking is right. Security is a teamwork and I don't trust myself, so I love someone can verify my apprach. # Security The ARM steps are executed in 2 caset: a) build image workfow. Those are not "directly" affected by PR changes - the command will only run the "Dockerfile RUN commands" inside the ARM docker engine - Dockerfile is the only user-provided code that is executed during that build - and it is only executed as part of "docker build" (which runs inside the docker builder). You'd have to escape the container to get into the ARM instance. This is basically the same (or lower) risk we have currently - the Dockerfile RUN commands are executed inside the docker-engine of the S3 AMD instance we run. It's actually a bit lower because the ARM instances do not run the "runner" and the runner - it's a "bare" ARM instance with docker engine running with SSH-port-forwarded connection to it. b) if you run PR from a branch in "apache" repo (like I did in this PR) - then the ARM instance will run in the CI worklfow. This is controlled by `if: needs.build-info.outputs.inWorkflowBuild == 'true'`. You need to have committer status to be able to push branch to Airflow Repo and make PR from it. Non-commiters can only run this CI workflow from fork, and even if they can change the ci.yaml there and change the `if: needs.build-info.outputs.inWorkflowBuild == 'true'` it will not work because fork-run ci.yaml workflow does not run on our machines and only our build machines have capability to spin the ARM instances. The"build_image" workflow does not run for those PR-s because they are skipped if pull request head is "apache/airlfow" (similarly as all the other "build" steps) https://github.com/apache/airflow/pull/24664/files#diff-a6ce9b5b0a53d8dc6dbd59d40b23994c3e68a91fac73c0d7eacdf373c95476c9R386. The reason for having those duplicate "build" steps in CI workflow (in case of PRs from apache/airflow repo) is that I can actually test those ARM builds. When they are run from "CI" workflow, they run with "my code", when they are run in "build" workflow - they are run with "main" code and I only can test them when I merge them (which is main reason some of the workflows need a few merges to finally succed - as I have literally no way to test them before merge. For example the "build" workflow Start ARM instance/Stop ARM instance had never been tested, because the only way I can test it is by merging the code to main. # Cost Those ARM machines have auto-cancelling - no matter if succeded or failed: 50 minues [self-kill-switch](https://github.com/apache/airflow/blob/main/scripts/ci/images/self_terminate.sh) and "cancel" is run always: https://github.com/apache/airflow/pull/24664/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fR417 - so even if someone presses cancel too many times or in case build "hangs" we got maximum ~ 1hr of instance run. The instances have [maximum bid spot price set]( https://github.com/apache/airflow/blob/48ceda22bdbee50b2d6ca24767164ce485f3c319/scripts/ci/images/ci_start_arm_instance_and_connect_to_docker.sh#L27) for requesting the instance. It is set to 0.1 USD /HR. Curent spot price in Ohio: `c6g.xlarge | $0.0399 per Hour` - so currently 1 h of running the instance is 0.04 USD - max 0.1 USD. Building the 4x images (when cache is refreshed - which is vast majority of times) is ~ 12 minutes. This means that running a single ARM instance step is 0.04 / 5 ~ 0.008 USD (max 0,02 USD). We have ~ 100 image builds / day on average working day = 0.8 to 2 USD / day = 30 USD Month (with rather generous over-counting the days). this is ~ 30/1300 ~ 2% of current spending. I personally think it's worth it as it saves a lot of our contributors from headaches on M1s - and this is not academic - this has already happened https://github.com/apache/airflow/pull/24635 and it was couple of hours lost for a few people. We have currently no protection mechanisms to prevent it from happening again - and taking into account that a number of deps (including mysql and mssql) do not yet have debian ARM package and fail when building, it is likely to happen again BTW. Probably we loose way more but not merging some optimisations which are currently in the pipeline (for example this one https://github.com/apache/airflow/pull/24666 - I have not assesed that one because it is far more difficult but I would guess it allows to save ~ 3 minutes of build time x2 for AMD instances at 0.08 USD /hr per version in ~ 80% (wild guess) of our PRs which is already more than that cost of those ARM instances. And I already saved quite a lot and more : https://github.com/apache/airflow/pull/24580 (and more to come with more breeze conversion) so I am on a cost-and-build-saving spree here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
