Hi everyone.

After a good two weeks of playing whack-a-mole with bugs, I have finally merged <https://github.com/apache/airflow/pull/13730> which means that /some/ builds now run on machines under our control.

The biggest difference this will make is that 1) we won't be stuck in a queue behind other ASF projects waiting for our "slot", 2) builds should also be a bit faster now due to running most of the build on tmpfs

I will do a more in-depth write up soon, but the rough architecture is:

- A GitHub application receives events and whenever* a check-run is created that posts to: - A AWS Lambda function (via API gateway) that check if there is an idle runner already - an ASG that configures r5a.xlarge instances with tmpfs in "interesting" places (docker store, tmp dirs etc) - Some clever processes on the instance that set/clear ScaleInProtection so that running jobs don't get killed, and emits a custom CloudWatch metric)
- A CloudWatch alarm to scale down the ASG when nodes are idle
- A paid-for docker hub user on these machines to avoid hitting pull limits.

The major downside is that due to security concerns, builds for non committers/PMC members still run on the public queue. However the "build image" step for everyone now runs on our machines, so everyone should benefit a bit.

I do expect a bit of fallout from this, so I will be monitoring the Actions queue, but if there are any problems or issues let me know (here, or on Slack)

-ash

Reply via email to