Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Jarek Potiuk Fri, 05 Apr 2024 08:36:45 -0700

Seeing no big "no's" - I will prepare and run the experiment - starting
some time next week, after we get 2.9.0 out - I do not want to break
anything there. In the meantime, preparatory PR to add "use self-hosted
runners" label is out https://github.com/apache/airflow/pull/38779


On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar
<rbish...@amazon.com.invalid> wrote:

> +1 with trying this out. I agree with keeping the canary builds
> self-hosted in order to validate the usage for the PRs.
>
> -- Rajesh
>
>
> From: Jarek Potiuk <ja...@potiuk.com>
> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org>
> Date: Friday, April 5, 2024 at 8:36 AM
> To: "dev@airflow.apache.org" <dev@airflow.apache.org>
> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> self-hosted runners for commiter PRs
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
> Yeah. Valid concerns Hussein.
>
> And I am happy to share some more information on that. I did not want to
> put all of that in the original email, but I see that might be interesting
> for you and possibly others.
>
> I am closely following the numbers now. One of the reasons I am doing /
> proposing it now is that finally (after almost 3 years of waiting) we
> finally have access to some metrics that we can check. As of last week I
> got access to the ASF metrics (
> https://issues.apache.org/jira/browse/INFRA-25662).
>
> I have access to "organisation" level information. Infra does not want to
> open it to everyone - even to every member -  but since I got very active
> and been helping with a number I got the access granted as an exception.
> Also I saw a small dashboard the INFRA prepares to open to everyone once
> they sort the access where we will be able to see the "per-project" usage.
>
> Some stats that I can share (they asked not to share too much).
>
> From what I looked at I can tell that we are right now (the whole ASF
> organisation) safely below the total capacity. With a large margin - enough
> to handle spikes, but of course the growth of usage is there and if
> uncontrolled - we can again reach the same situation that triggered getting
> self-hosted runners a few years ago.
>
> Luckily - INRA gets it under control this time |(and metrics will help).
> In the last INFRA newsletter, they announced some limitations that will
> apply to the projects (effective as of end of April) - so once those will
> be followed, we should be "safe" from being impacted by others (i.e.
> noisy-neighbour effect). Some of the projects (not Airflow (!) ) were
> exceeding those so far and they will be capped - they will need to optimize
> their builds eventually.
>
> Those are the rules:
>
> * All workflows MUST have a job concurrency level less than or equal to
> 20. This means a workflow cannot have more than 20 jobs running at the same
> time across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to
> 15. Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200
> hours).
> * The average number of minutes a project uses in any consecutive five-day
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000
> minutes, or 3,600 hours).
> * Projects whose builds consistently cross the maximum use limits will
> lose their access to GitHub Actions until they fix their build
> configurations.
>
> Those numbers on their own do not tell much, but we can easily see what
> they mean when we put them side-by-side t with "our" current numbers.
>
> * Currently - with all the "public" usage we are at 8 full-time runners.
> This is after some of the changes I've done, With the recent changes I
> already moved a lot of the non-essential build components that do not
> require a lot of parallelism to public runners.
> * The 20/15 jobs limit is a bit artificial (not really enforceable on
> workflow level) - but in our case as I optimized most PR to run just a
> subset of the tests, The average will be way below that - no matter if you
> are committer or not, regular PRs are far smaller subset of the jobs than
> full "canary" build. And for canary builds we should stay - at least for
> now - with self-hosted runners.
>
> Some of the back-of-the envelope calculations of what might happen when we
> switch to "public" for everyone:
>
> Unfortunately, until we enable the experiment, I do not have an easy way
> to distinguish the "canary" from "committer" runs so those are a bit
> guesses. But our self-hosted build time vs. public build time is ~ 20% more
> for self-hosted (100.000 minutes vs. 80.000 minutes this month) - see the
> attached screenshot for the current month.
> As you can see - building images are already moved to public runners for
> everyone as of two weeks or so, so that will not change.
>
> Taking into account that self-hosted ones are ~ 1.7x faster, this means
> that currently we have ~ 2x more self-hosted time used than public. We can
> assume that 50% of that are committer PRs and "Canary" builds are the
> second half (sounds safe because canary builds use way more resources, even
> if committers run many more PRs than merges).
> So by moving committer builds to public runners, we will - likely -
> increase our public time 2x (from 8 FT runners to 16 FT runners) - way
> below the 25 FT runners that is the "cap" from INFRA, Even if we move all
> Canary builds there, we should be at most at ~24 FTs, which is still below
> the limits. but would be dangerously close to it. That's why I want to keep
> canary builds as self-hosted until we can get some clarity on the "PR"
> moving impact.
>
> We will see the final numbers when we move, but I think we are pretty safe
> within the limits.
>
> J.
>
>
> On Fri, Apr 5, 2024 at 1:16 PM Hussein Awala <huss...@awala.fr<mailto:
> huss...@awala.fr>> wrote:
> Although 900 runners seem like a lot, they are shared among the Apache
> organization's 2.2k repositories, of course only a few of them are active
> (let's say 50), and some of them use an external CI tool for big jobs (eg:
> Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very
> active repositories based entirely on GHA, for example, Iceberg, Spark,
> Superset, ...
>
> I haven't found the AFS runners metrics dashboard to check the max
> concurrency and the max queued time during peak hours, but I'm sure that
> moving Airflow committers' CI jobs to public runners will put some pressure
> on these runners, especially since these committers are the most active
> contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs and 64
> GB RAM) are used almost all the time, so we can say that we will need
> around 70 AFS runners to run the same jobs.
>
> There is no harm in testing and deciding after 2-3 weeks.
>
> We also need to find a way to let the infra team help us solve the
> connectivity problem with the ARC runners
> <
> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme
> >
> .
>
> +1 for testing what you propose.
>
> On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <amoghdesai....@gmail.com
> <mailto:amoghdesai....@gmail.com>>
> wrote:
>
> > +1 I like the idea.
> > Looking forward to seeing the difference.
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis
> > <ferru...@amazon.com.invalid>
> > wrote:
> >
> > > Interested in seeing the difference, +1
> > >
> > >
> > >  - ferruzzi
> > >
> > >
> > > ________________________________
> > > From: Oliveira, Niko <oniko...@amazon.com.INVALID>
> > > Sent: Thursday, April 4, 2024 2:00 PM
> > > To: dev@airflow.apache.org<mailto:dev@airflow.apache.org>
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > > self-hosted runners for commiter PRs
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > +1I'd love to see this as well.
> > >
> > > In the past, stability and long queue times of PR builds have been very
> > > frustrating. I'm not 100% sure this is due to using self hosted
> runners,
> > > since 35 queue depth (to my mind) should be plenty. But something about
> > > that setup has never seemed quite right to me with queuing. Switching
> to
> > > public runners for a while to experiment would be great to see if it
> > > improves.
> > >
> > > ________________________________
> > > From: Pankaj Koti <pankaj.k...@astronomer.io<mailto:
> pankaj.k...@astronomer.io>.INVALID>
> > > Sent: Thursday, April 4, 2024 12:41:02 PM
> > > To: dev@airflow.apache.org<mailto:dev@airflow.apache.org>
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > > self-hosted runners for commiter PRs
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > +1 from me to this idea.
> > >
> > > Sounds very reasonable to me.
> > > At times, my experience has been better with public runners instead of
> > > self-hosted runners :)
> > >
> > > And like already mentioned in the discussion, I think having the
> ability
> > of
> > > a applying the label "use-self-hosted-runners" to be used for critical
> > > times would be nice to have too.
> > >
> > >
> > > On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com<mailto:
> ja...@potiuk.com>> wrote:
> > >
> > > > Hello everyone,
> > > >
> > > > TL;DR With some recent changes in GitHub Actions and the fact that
> ASF
> > > has
> > > > a lot of runners available donated for all the builds, I think we
> could
> > > > experiment with disabling "self-hosted" runners for committer builds.
> > > >
> > > > The self-hosted runners of ours have been extremely helpful (and we
> > > should
> > > > again thank Amazon and Astronomer for donating credits / money for
> > > those) -
> > > > when the Github Public runners have been far less powerful - and we
> had
> > > > less number of those available for ASF projects. This saved us a LOT
> of
> > > > troubles where there was a contention between ASF projects.
> > > >
> > > > But as of recently both limitations have been largely removed:
> > > >
> > > > * ASF has 900 public runners donated by GitHub to all projects
> > > > * Those public runners have (as of January) for open-source projects
> > now
> > > > have 4 CPUS and 16GB of memory -
> > > >
> > > >
> > >
> >
> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
> > > >
> > > >
> > > > While they are not as powerful as our self-hosted runners, the
> > > parallelism
> > > > we utilise for those brings those builds in not-that bad shape
> compared
> > > to
> > > > self-hosted runners. Typical differences between the public and
> > > self-hosted
> > > > runners now for the complete set of tests are ~ 20m for public
> runners
> > > and
> > > > ~14 m for self-hosted ones.
> > > >
> > > > But this is not the only factor - I think committers experience the
> > "Job
> > > > failed" for self-hosted runners generally much more often than
> > > > non-committers (stability of our solution is not best, also we are
> > using
> > > > cheaper spot instances). Plus - we limit the total number of
> > self-hosted
> > > > runners (35) - so if several committers submit a few PRs and we have
> > > canary
> > > > build running, the jobs will wait until runners are available.
> > > >
> > > > And of course it costs the credits/money of sponsors which we could
> use
> > > for
> > > > other things.
> > > >
> > > > I have - as of recently - access to Github Actions metrics - and
> while
> > > ASF
> > > > is keeping an eye and stared limiting the number of parallel jobs
> > > workflows
> > > > in projects are run, it looks like even if all committer runs are
> added
> > > to
> > > > the public runners, we will still cause far lower usage that the
> limits
> > > are
> > > > and far lower than some other projects (which I will not name
> here).  I
> > > > have access to the metrics so I can monitor our usage and react.
> > > >
> > > > I think possibly - if we switch committers to "public" runners by
> > default
> > > > -the experience will not be much worse for them (and sometimes even
> > > better
> > > > - because of stability/limited queue).
> > > >
> > > > I was planning this carefully - I made a number of refactors/changes
> to
> > > our
> > > > workflows recently that makes it way easier to manipulate the
> > > configuration
> > > > and get various conditions applied to various jobs - so
> > > > changing/experimenting with those settings should be - well - a
> breeze
> > > :).
> > > > Few recent changes had proven that this change and workflow refactor
> > were
> > > > definitely worth the effort, I feel like I finally got a control over
> > it
> > > > where previously it was a bit like herding a pack of cats (which I
> > > > brought to live by myself, but that's another story).
> > > >
> > > > I would like to propose to run an experiment and see how it works if
> we
> > > > switch committer PRs back to the public runners - leaving the
> > self-hosted
> > > > runners only for canary builds (which makes perfect sense because
> those
> > > > builds run a full set of tests and we need as much speed and power
> > there
> > > as
> > > > we can.
> > > >
> > > > This is pretty safe, We should be able to switch back very easily if
> we
> > > see
> > > > problems. I will also monitor it and see if our usage is within the
> > > limits
> > > > of the ASF. I can also add the feature that committers should be able
> > to
> > > > use self-hosted runners by applying the "use self-hosted runners"
> label
> > > to
> > > > a PR.
> > > >
> > > > Running it for 2-3 weeks should be enough to gather experience from
> > > > committers - whether things will seem better or worse for them - or
> > maybe
> > > > they won't really notice a big difference.
> > > >
> > > > Later we could consider some next steps - disabling the self-hosted
> > > runners
> > > > for canary builds if we see that our usage is low and build are fast
> > > > enough, eventually possibly removing current self-hosted runners and
> > > > switching to a better k8s based infrastructure (which we are close to
> > do
> > > > but it makes it a bit difficult while current self-hosted solution is
> > so
> > > > critical to keep it running (like rebuilding the plane while it is
> > > flying).
> > > > I'd love to do it gradually in the "change slowly and observe" mode -
> > > > especially now that I have access to "proper" metrics.
> > > >
> > > > WDYT?
> > > >
> > > > J.
> > > >
> > >
> >
>

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Reply via email to