Thank you for your work on this Ash!
One thing to mention is that while this only directly affects the committer/PMC runs, it should still free up more resources overall. Might also be worth bringing this up to the ASF board as perhaps other projects can consider similar methods.

On Tue, Feb 9, 2021 at 6:16 AM, Kaxil Naik <[email protected]> wrote:
Great work on this, I know much time you dedicated on it.
Regards, Kaxil
On Tue, Feb 9, 2021 at 1:40 PM Ash Berlin-Taylor < [email protected] [[email protected]] > wrote:
Hi everyone.
After a good two weeks of playing whack-a-mole with bugs, I have finally merged https://github.com/apache/airflow/pull/13730 [https://github.com/apache/airflow/pull/13730] which means that some builds now run on machines under our control. The biggest difference this will make is that 1) we won't be stuck in a queue behind other ASF projects waiting for our "slot", 2) builds should also be a bit faster now due to running most of the build on tmpfs
I will do a more in-depth write up soon, but the rough architecture is:
- A GitHub application receives events and whenever* a check-run is created that posts to: - A AWS Lambda function (via API gateway) that check if there is an idle runner already - an ASG that configures r5a.xlarge instances with tmpfs in "interesting" places (docker store, tmp dirs etc) - Some clever processes on the instance that set/clear ScaleInProtection so that running jobs don't get killed, and emits a custom CloudWatch metric) - A CloudWatch alarm to scale down the ASG when nodes are idle - A paid-for docker hub user on these machines to avoid hitting pull limits. The major downside is that due to security concerns, builds for non committers/PMC members still run on the public queue. However the "build image" step for everyone now runs on our machines, so everyone should benefit a bit. I do expect a bit of fallout from this, so I will be monitoring the Actions queue, but if there are any problems or issues let me know (here, or on Slack)
-ash

Reply via email to