Hi everyone.
After a good two weeks of playing whack-a-mole with bugs, I have
finally merged <https://github.com/apache/airflow/pull/13730> which
means that /some/ builds now run on machines under our control.
The biggest difference this will make is that 1) we won't be stuck in a
queue behind other ASF projects waiting for our "slot", 2) builds
should also be a bit faster now due to running most of the build on
tmpfs
I will do a more in-depth write up soon, but the rough architecture is:
- A GitHub application receives events and whenever* a check-run is
created that posts to:
- A AWS Lambda function (via API gateway) that check if there is an
idle runner already
- an ASG that configures r5a.xlarge instances with tmpfs in
"interesting" places (docker store, tmp dirs etc)
- Some clever processes on the instance that set/clear
ScaleInProtection so that running jobs don't get killed, and emits a
custom CloudWatch metric)
- A CloudWatch alarm to scale down the ASG when nodes are idle
- A paid-for docker hub user on these machines to avoid hitting pull
limits.
The major downside is that due to security concerns, builds for non
committers/PMC members still run on the public queue. However the
"build image" step for everyone now runs on our machines, so everyone
should benefit a bit.
I do expect a bit of fallout from this, so I will be monitoring the
Actions queue, but if there are any problems or issues let me know
(here, or on Slack)
-ash