Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Jarek Potiuk Fri, 05 Apr 2024 05:36:03 -0700

Yeah. Valid concerns Hussein.

And I am happy to share some more information on that. I did not want to
put all of that in the original email, but I see that might be interesting
for you and possibly others.

I am closely following the numbers now. One of the reasons I am doing /
proposing it now is that finally (after almost 3 years of waiting) we
finally have access to some metrics that we can check. As of last week I
got access to the ASF metrics (
https://issues.apache.org/jira/browse/INFRA-25662).

I have access to "organisation" level information. Infra does not want to
open it to everyone - even to every member -  but since I got very active
and been helping with a number I got the access granted as an exception.
Also I saw a small dashboard the INFRA prepares to open to everyone once
they sort the access where we will be able to see the "per-project" usage.

Some stats that I can share (they asked not to share too much).

>From what I looked at I can tell that we are right now (the whole ASF
organisation) safely below the total capacity. With a large margin - enough
to handle spikes, but of course the growth of usage is there and if
uncontrolled - we can again reach the same situation that triggered getting
self-hosted runners a few years ago.

Luckily - INRA gets it under control this time |(and metrics will help). In
the last INFRA newsletter, they announced some limitations that will apply
to the projects (effective as of end of April) - so once those will be
followed, we should be "safe" from being impacted by others (i.e.
noisy-neighbour effect). Some of the projects (not Airflow (!) ) were
exceeding those so far and they will be capped - they will need to optimize
their builds eventually.

Those are the rules:

* All workflows MUST have a job concurrency level less than or equal to 20.
This means a workflow cannot have more than 20 jobs running at the same
time across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to
15. Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200
hours).
* The average number of minutes a project uses in any consecutive five-day
period MUST NOT exceed the equivalent of 30 full-time runners (216,000
minutes, or 3,600 hours).
* Projects whose builds consistently cross the maximum use limits will lose
their access to GitHub Actions until they fix their build configurations.

Those numbers on their own do not tell much, but we can easily see what
they mean when we put them side-by-side t with "our" current numbers.

* Currently - with all the "public" usage we are at 8 full-time runners.
This is after some of the changes I've done, With the recent changes I
already moved a lot of the non-essential build components that do not
require a lot of parallelism to public runners.
* The 20/15 jobs limit is a bit artificial (not really enforceable on
workflow level) - but in our case as I optimized most PR to run just a
subset of the tests, The average will be way below that - no matter if you
are committer or not, regular PRs are far smaller subset of the jobs than
full "canary" build. And for canary builds we should stay - at least for
now - with self-hosted runners.

Some of the back-of-the envelope calculations of what might happen when we
switch to "public" for everyone:

Unfortunately, until we enable the experiment, I do not have an easy way to
distinguish the "canary" from "committer" runs so those are a bit guesses.
But our self-hosted build time vs. public build time is ~ 20% more for
self-hosted (100.000 minutes vs. 80.000 minutes this month) - see the
attached screenshot for the current month.
As you can see - building images are already moved to public runners for
everyone as of two weeks or so, so that will not change.

Taking into account that self-hosted ones are ~ 1.7x faster, this means
that currently we have ~ 2x more self-hosted time used than public. We can
assume that 50% of that are committer PRs and "Canary" builds are the
second half (sounds safe because canary builds use way more resources, even
if committers run many more PRs than merges).
So by moving committer builds to public runners, we will - likely -
increase our public time 2x (from 8 FT runners to 16 FT runners) - way
below the 25 FT runners that is the "cap" from INFRA, Even if we move all
Canary builds there, we should be at most at ~24 FTs, which is still below
the limits. but would be dangerously close to it. That's why I want to keep
canary builds as self-hosted until we can get some clarity on the "PR"
moving impact.

We will see the final numbers when we move, but I think we are pretty safe
within the limits.

J.

On Fri, Apr 5, 2024 at 1:16 PM Hussein Awala <huss...@awala.fr> wrote:

> Although 900 runners seem like a lot, they are shared among the Apache
> organization's 2.2k repositories, of course only a few of them are active
> (let's say 50), and some of them use an external CI tool for big jobs (eg:
> Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very
> active repositories based entirely on GHA, for example, Iceberg, Spark,
> Superset, ...
>
> I haven't found the AFS runners metrics dashboard to check the max
> concurrency and the max queued time during peak hours, but I'm sure that
> moving Airflow committers' CI jobs to public runners will put some pressure
> on these runners, especially since these committers are the most active
> contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs and 64
> GB RAM) are used almost all the time, so we can say that we will need
> around 70 AFS runners to run the same jobs.
>
> There is no harm in testing and deciding after 2-3 weeks.
>
> We also need to find a way to let the infra team help us solve the
> connectivity problem with the ARC runners
> <
> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme
> >
> .
>
> +1 for testing what you propose.
>
> On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <amoghdesai....@gmail.com>
> wrote:
>
> > +1 I like the idea.
> > Looking forward to seeing the difference.
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis
> > <ferru...@amazon.com.invalid>
> > wrote:
> >
> > > Interested in seeing the difference, +1
> > >
> > >
> > >  - ferruzzi
> > >
> > >
> > > ________________________________
> > > From: Oliveira, Niko <oniko...@amazon.com.INVALID>
> > > Sent: Thursday, April 4, 2024 2:00 PM
> > > To: dev@airflow.apache.org
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > > self-hosted runners for commiter PRs
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > +1I'd love to see this as well.
> > >
> > > In the past, stability and long queue times of PR builds have been very
> > > frustrating. I'm not 100% sure this is due to using self hosted
> runners,
> > > since 35 queue depth (to my mind) should be plenty. But something about
> > > that setup has never seemed quite right to me with queuing. Switching
> to
> > > public runners for a while to experiment would be great to see if it
> > > improves.
> > >
> > > ________________________________
> > > From: Pankaj Koti <pankaj.k...@astronomer.io.INVALID>
> > > Sent: Thursday, April 4, 2024 12:41:02 PM
> > > To: dev@airflow.apache.org
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > > self-hosted runners for commiter PRs
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > +1 from me to this idea.
> > >
> > > Sounds very reasonable to me.
> > > At times, my experience has been better with public runners instead of
> > > self-hosted runners :)
> > >
> > > And like already mentioned in the discussion, I think having the
> ability
> > of
> > > a applying the label "use-self-hosted-runners" to be used for critical
> > > times would be nice to have too.
> > >
> > >
> > > On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com> wrote:
> > >
> > > > Hello everyone,
> > > >
> > > > TL;DR With some recent changes in GitHub Actions and the fact that
> ASF
> > > has
> > > > a lot of runners available donated for all the builds, I think we
> could
> > > > experiment with disabling "self-hosted" runners for committer builds.
> > > >
> > > > The self-hosted runners of ours have been extremely helpful (and we
> > > should
> > > > again thank Amazon and Astronomer for donating credits / money for
> > > those) -
> > > > when the Github Public runners have been far less powerful - and we
> had
> > > > less number of those available for ASF projects. This saved us a LOT
> of
> > > > troubles where there was a contention between ASF projects.
> > > >
> > > > But as of recently both limitations have been largely removed:
> > > >
> > > > * ASF has 900 public runners donated by GitHub to all projects
> > > > * Those public runners have (as of January) for open-source projects
> > now
> > > > have 4 CPUS and 16GB of memory -
> > > >
> > > >
> > >
> >
> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
> > > >
> > > >
> > > > While they are not as powerful as our self-hosted runners, the
> > > parallelism
> > > > we utilise for those brings those builds in not-that bad shape
> compared
> > > to
> > > > self-hosted runners. Typical differences between the public and
> > > self-hosted
> > > > runners now for the complete set of tests are ~ 20m for public
> runners
> > > and
> > > > ~14 m for self-hosted ones.
> > > >
> > > > But this is not the only factor - I think committers experience the
> > "Job
> > > > failed" for self-hosted runners generally much more often than
> > > > non-committers (stability of our solution is not best, also we are
> > using
> > > > cheaper spot instances). Plus - we limit the total number of
> > self-hosted
> > > > runners (35) - so if several committers submit a few PRs and we have
> > > canary
> > > > build running, the jobs will wait until runners are available.
> > > >
> > > > And of course it costs the credits/money of sponsors which we could
> use
> > > for
> > > > other things.
> > > >
> > > > I have - as of recently - access to Github Actions metrics - and
> while
> > > ASF
> > > > is keeping an eye and stared limiting the number of parallel jobs
> > > workflows
> > > > in projects are run, it looks like even if all committer runs are
> added
> > > to
> > > > the public runners, we will still cause far lower usage that the
> limits
> > > are
> > > > and far lower than some other projects (which I will not name
> here).  I
> > > > have access to the metrics so I can monitor our usage and react.
> > > >
> > > > I think possibly - if we switch committers to "public" runners by
> > default
> > > > -the experience will not be much worse for them (and sometimes even
> > > better
> > > > - because of stability/limited queue).
> > > >
> > > > I was planning this carefully - I made a number of refactors/changes
> to
> > > our
> > > > workflows recently that makes it way easier to manipulate the
> > > configuration
> > > > and get various conditions applied to various jobs - so
> > > > changing/experimenting with those settings should be - well - a
> breeze
> > > :).
> > > > Few recent changes had proven that this change and workflow refactor
> > were
> > > > definitely worth the effort, I feel like I finally got a control over
> > it
> > > > where previously it was a bit like herding a pack of cats (which I
> > > > brought to live by myself, but that's another story).
> > > >
> > > > I would like to propose to run an experiment and see how it works if
> we
> > > > switch committer PRs back to the public runners - leaving the
> > self-hosted
> > > > runners only for canary builds (which makes perfect sense because
> those
> > > > builds run a full set of tests and we need as much speed and power
> > there
> > > as
> > > > we can.
> > > >
> > > > This is pretty safe, We should be able to switch back very easily if
> we
> > > see
> > > > problems. I will also monitor it and see if our usage is within the
> > > limits
> > > > of the ASF. I can also add the feature that committers should be able
> > to
> > > > use self-hosted runners by applying the "use self-hosted runners"
> label
> > > to
> > > > a PR.
> > > >
> > > > Running it for 2-3 weeks should be enough to gather experience from
> > > > committers - whether things will seem better or worse for them - or
> > maybe
> > > > they won't really notice a big difference.
> > > >
> > > > Later we could consider some next steps - disabling the self-hosted
> > > runners
> > > > for canary builds if we see that our usage is low and build are fast
> > > > enough, eventually possibly removing current self-hosted runners and
> > > > switching to a better k8s based infrastructure (which we are close to
> > do
> > > > but it makes it a bit difficult while current self-hosted solution is
> > so
> > > > critical to keep it running (like rebuilding the plane while it is
> > > flying).
> > > > I'd love to do it gradually in the "change slowly and observe" mode -
> > > > especially now that I have access to "proper" metrics.
> > > >
> > > > WDYT?
> > > >
> > > > J.
> > > >
> > >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Reply via email to