Hello everyone, TL;DR With some recent changes in GitHub Actions and the fact that ASF has a lot of runners available donated for all the builds, I think we could experiment with disabling "self-hosted" runners for committer builds.
The self-hosted runners of ours have been extremely helpful (and we should again thank Amazon and Astronomer for donating credits / money for those) - when the Github Public runners have been far less powerful - and we had less number of those available for ASF projects. This saved us a LOT of troubles where there was a contention between ASF projects. But as of recently both limitations have been largely removed: * ASF has 900 public runners donated by GitHub to all projects * Those public runners have (as of January) for open-source projects now have 4 CPUS and 16GB of memory - https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/ While they are not as powerful as our self-hosted runners, the parallelism we utilise for those brings those builds in not-that bad shape compared to self-hosted runners. Typical differences between the public and self-hosted runners now for the complete set of tests are ~ 20m for public runners and ~14 m for self-hosted ones. But this is not the only factor - I think committers experience the "Job failed" for self-hosted runners generally much more often than non-committers (stability of our solution is not best, also we are using cheaper spot instances). Plus - we limit the total number of self-hosted runners (35) - so if several committers submit a few PRs and we have canary build running, the jobs will wait until runners are available. And of course it costs the credits/money of sponsors which we could use for other things. I have - as of recently - access to Github Actions metrics - and while ASF is keeping an eye and stared limiting the number of parallel jobs workflows in projects are run, it looks like even if all committer runs are added to the public runners, we will still cause far lower usage that the limits are and far lower than some other projects (which I will not name here). I have access to the metrics so I can monitor our usage and react. I think possibly - if we switch committers to "public" runners by default -the experience will not be much worse for them (and sometimes even better - because of stability/limited queue). I was planning this carefully - I made a number of refactors/changes to our workflows recently that makes it way easier to manipulate the configuration and get various conditions applied to various jobs - so changing/experimenting with those settings should be - well - a breeze :). Few recent changes had proven that this change and workflow refactor were definitely worth the effort, I feel like I finally got a control over it where previously it was a bit like herding a pack of cats (which I brought to live by myself, but that's another story). I would like to propose to run an experiment and see how it works if we switch committer PRs back to the public runners - leaving the self-hosted runners only for canary builds (which makes perfect sense because those builds run a full set of tests and we need as much speed and power there as we can. This is pretty safe, We should be able to switch back very easily if we see problems. I will also monitor it and see if our usage is within the limits of the ASF. I can also add the feature that committers should be able to use self-hosted runners by applying the "use self-hosted runners" label to a PR. Running it for 2-3 weeks should be enough to gather experience from committers - whether things will seem better or worse for them - or maybe they won't really notice a big difference. Later we could consider some next steps - disabling the self-hosted runners for canary builds if we see that our usage is low and build are fast enough, eventually possibly removing current self-hosted runners and switching to a better k8s based infrastructure (which we are close to do but it makes it a bit difficult while current self-hosted solution is so critical to keep it running (like rebuilding the plane while it is flying). I'd love to do it gradually in the "change slowly and observe" mode - especially now that I have access to "proper" metrics. WDYT? J.