[DISCUSS] Consider disabling self-hosted runners for commiter PRs

Jarek Potiuk Thu, 04 Apr 2024 12:20:12 -0700

Hello everyone,

TL;DR With some recent changes in GitHub Actions and the fact that ASF has
a lot of runners available donated for all the builds, I think we could
experiment with disabling "self-hosted" runners for committer builds.


The self-hosted runners of ours have been extremely helpful (and we should
again thank Amazon and Astronomer for donating credits / money for those) -
when the Github Public runners have been far less powerful - and we had
less number of those available for ASF projects. This saved us a LOT of
troubles where there was a contention between ASF projects.

But as of recently both limitations have been largely removed:

* ASF has 900 public runners donated by GitHub to all projects
* Those public runners have (as of January) for open-source projects now
have 4 CPUS and 16GB of memory -
https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/


While they are not as powerful as our self-hosted runners, the parallelism
we utilise for those brings those builds in not-that bad shape compared to
self-hosted runners. Typical differences between the public and self-hosted
runners now for the complete set of tests are ~ 20m for public runners and
~14 m for self-hosted ones.

But this is not the only factor - I think committers experience the "Job
failed" for self-hosted runners generally much more often than
non-committers (stability of our solution is not best, also we are using
cheaper spot instances). Plus - we limit the total number of self-hosted
runners (35) - so if several committers submit a few PRs and we have canary
build running, the jobs will wait until runners are available.

And of course it costs the credits/money of sponsors which we could use for
other things.

I have - as of recently - access to Github Actions metrics - and while ASF
is keeping an eye and stared limiting the number of parallel jobs workflows
in projects are run, it looks like even if all committer runs are added to
the public runners, we will still cause far lower usage that the limits are
and far lower than some other projects (which I will not name here).  I
have access to the metrics so I can monitor our usage and react.

I think possibly - if we switch committers to "public" runners by default
-the experience will not be much worse for them (and sometimes even better
- because of stability/limited queue).

I was planning this carefully - I made a number of refactors/changes to our
workflows recently that makes it way easier to manipulate the configuration
and get various conditions applied to various jobs - so
changing/experimenting with those settings should be - well - a breeze :).
Few recent changes had proven that this change and workflow refactor were
definitely worth the effort, I feel like I finally got a control over it
where previously it was a bit like herding a pack of cats (which I
brought to live by myself, but that's another story).

I would like to propose to run an experiment and see how it works if we
switch committer PRs back to the public runners - leaving the self-hosted
runners only for canary builds (which makes perfect sense because those
builds run a full set of tests and we need as much speed and power there as
we can.

This is pretty safe, We should be able to switch back very easily if we see
problems. I will also monitor it and see if our usage is within the limits
of the ASF. I can also add the feature that committers should be able to
use self-hosted runners by applying the "use self-hosted runners" label to
a PR.

Running it for 2-3 weeks should be enough to gather experience from
committers - whether things will seem better or worse for them - or maybe
they won't really notice a big difference.

Later we could consider some next steps - disabling the self-hosted runners
for canary builds if we see that our usage is low and build are fast
enough, eventually possibly removing current self-hosted runners and
switching to a better k8s based infrastructure (which we are close to do
but it makes it a bit difficult while current self-hosted solution is so
critical to keep it running (like rebuilding the plane while it is flying).
I'd love to do it gradually in the "change slowly and observe" mode -
especially now that I have access to "proper" metrics.

WDYT?

J.

[DISCUSS] Consider disabling self-hosted runners for commiter PRs

Reply via email to