+1 from me to this idea. Sounds very reasonable to me. At times, my experience has been better with public runners instead of self-hosted runners :)
And like already mentioned in the discussion, I think having the ability of a applying the label "use-self-hosted-runners" to be used for critical times would be nice to have too. On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <[email protected]> wrote: > Hello everyone, > > TL;DR With some recent changes in GitHub Actions and the fact that ASF has > a lot of runners available donated for all the builds, I think we could > experiment with disabling "self-hosted" runners for committer builds. > > The self-hosted runners of ours have been extremely helpful (and we should > again thank Amazon and Astronomer for donating credits / money for those) - > when the Github Public runners have been far less powerful - and we had > less number of those available for ASF projects. This saved us a LOT of > troubles where there was a contention between ASF projects. > > But as of recently both limitations have been largely removed: > > * ASF has 900 public runners donated by GitHub to all projects > * Those public runners have (as of January) for open-source projects now > have 4 CPUS and 16GB of memory - > > https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/ > > > While they are not as powerful as our self-hosted runners, the parallelism > we utilise for those brings those builds in not-that bad shape compared to > self-hosted runners. Typical differences between the public and self-hosted > runners now for the complete set of tests are ~ 20m for public runners and > ~14 m for self-hosted ones. > > But this is not the only factor - I think committers experience the "Job > failed" for self-hosted runners generally much more often than > non-committers (stability of our solution is not best, also we are using > cheaper spot instances). Plus - we limit the total number of self-hosted > runners (35) - so if several committers submit a few PRs and we have canary > build running, the jobs will wait until runners are available. > > And of course it costs the credits/money of sponsors which we could use for > other things. > > I have - as of recently - access to Github Actions metrics - and while ASF > is keeping an eye and stared limiting the number of parallel jobs workflows > in projects are run, it looks like even if all committer runs are added to > the public runners, we will still cause far lower usage that the limits are > and far lower than some other projects (which I will not name here). I > have access to the metrics so I can monitor our usage and react. > > I think possibly - if we switch committers to "public" runners by default > -the experience will not be much worse for them (and sometimes even better > - because of stability/limited queue). > > I was planning this carefully - I made a number of refactors/changes to our > workflows recently that makes it way easier to manipulate the configuration > and get various conditions applied to various jobs - so > changing/experimenting with those settings should be - well - a breeze :). > Few recent changes had proven that this change and workflow refactor were > definitely worth the effort, I feel like I finally got a control over it > where previously it was a bit like herding a pack of cats (which I > brought to live by myself, but that's another story). > > I would like to propose to run an experiment and see how it works if we > switch committer PRs back to the public runners - leaving the self-hosted > runners only for canary builds (which makes perfect sense because those > builds run a full set of tests and we need as much speed and power there as > we can. > > This is pretty safe, We should be able to switch back very easily if we see > problems. I will also monitor it and see if our usage is within the limits > of the ASF. I can also add the feature that committers should be able to > use self-hosted runners by applying the "use self-hosted runners" label to > a PR. > > Running it for 2-3 weeks should be enough to gather experience from > committers - whether things will seem better or worse for them - or maybe > they won't really notice a big difference. > > Later we could consider some next steps - disabling the self-hosted runners > for canary builds if we see that our usage is low and build are fast > enough, eventually possibly removing current self-hosted runners and > switching to a better k8s based infrastructure (which we are close to do > but it makes it a bit difficult while current self-hosted solution is so > critical to keep it running (like rebuilding the plane while it is flying). > I'd love to do it gradually in the "change slowly and observe" mode - > especially now that I have access to "proper" metrics. > > WDYT? > > J. >
