Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Jarek Potiuk Thu, 18 Apr 2024 05:11:27 -0700

The change is merged, rebasing should trigger maintainers PRs using public
runners. they should be able to switch to "self-hosted" by "use self hosted
runners" label. The `main` and `v2-9-test` runs should still be run using
self-hosted runners.


I would love to hear back from the maintainers if that helps with their
experience.

On Thu, Apr 18, 2024 at 10:59 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> PR switching it here: https://github.com/apache/airflow/pull/39106 -
> sorry for the delay in following up on that one.
>
> J.
>
> On Fri, Apr 5, 2024 at 6:08 PM Wei Lee <weilee...@gmail.com> wrote:
>
>> +1 for this. I do not yet have enough chance to experience many job
>> failures, but it won’t harm us to test them out. Plus, it saves some of the
>> cost.
>>
>> Best,
>> Wei
>>
>> > On Apr 5, 2024, at 11:36 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >
>> > Seeing no big "no's" - I will prepare and run the experiment - starting
>> > some time next week, after we get 2.9.0 out - I do not want to break
>> > anything there. In the meantime, preparatory PR to add "use self-hosted
>> > runners" label is out https://github.com/apache/airflow/pull/38779
>> >
>> > On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar
>> > <rbish...@amazon.com.invalid> wrote:
>> >
>> >> +1 with trying this out. I agree with keeping the canary builds
>> >> self-hosted in order to validate the usage for the PRs.
>> >>
>> >> -- Rajesh
>> >>
>> >>
>> >> From: Jarek Potiuk <ja...@potiuk.com>
>> >> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org>
>> >> Date: Friday, April 5, 2024 at 8:36 AM
>> >> To: "dev@airflow.apache.org" <dev@airflow.apache.org>
>> >> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
>> >> self-hosted runners for commiter PRs
>> >>
>> >>
>> >> CAUTION: This email originated from outside of the organization. Do not
>> >> click links or open attachments unless you can confirm the sender and
>> know
>> >> the content is safe.
>> >>
>> >>
>> >>
>> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> externe.
>> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>> pouvez
>> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
>> que
>> >> le contenu ne présente aucun risque.
>> >>
>> >>
>> >> Yeah. Valid concerns Hussein.
>> >>
>> >> And I am happy to share some more information on that. I did not want
>> to
>> >> put all of that in the original email, but I see that might be
>> interesting
>> >> for you and possibly others.
>> >>
>> >> I am closely following the numbers now. One of the reasons I am doing /
>> >> proposing it now is that finally (after almost 3 years of waiting) we
>> >> finally have access to some metrics that we can check. As of last week
>> I
>> >> got access to the ASF metrics (
>> >> https://issues.apache.org/jira/browse/INFRA-25662).
>> >>
>> >> I have access to "organisation" level information. Infra does not want
>> to
>> >> open it to everyone - even to every member -  but since I got very
>> active
>> >> and been helping with a number I got the access granted as an
>> exception.
>> >> Also I saw a small dashboard the INFRA prepares to open to everyone
>> once
>> >> they sort the access where we will be able to see the "per-project"
>> usage.
>> >>
>> >> Some stats that I can share (they asked not to share too much).
>> >>
>> >> From what I looked at I can tell that we are right now (the whole ASF
>> >> organisation) safely below the total capacity. With a large margin -
>> enough
>> >> to handle spikes, but of course the growth of usage is there and if
>> >> uncontrolled - we can again reach the same situation that triggered
>> getting
>> >> self-hosted runners a few years ago.
>> >>
>> >> Luckily - INRA gets it under control this time |(and metrics will
>> help).
>> >> In the last INFRA newsletter, they announced some limitations that will
>> >> apply to the projects (effective as of end of April) - so once those
>> will
>> >> be followed, we should be "safe" from being impacted by others (i.e.
>> >> noisy-neighbour effect). Some of the projects (not Airflow (!) ) were
>> >> exceeding those so far and they will be capped - they will need to
>> optimize
>> >> their builds eventually.
>> >>
>> >> Those are the rules:
>> >>
>> >> * All workflows MUST have a job concurrency level less than or equal to
>> >> 20. This means a workflow cannot have more than 20 jobs running at the
>> same
>> >> time across all matrices.
>> >> * All workflows SHOULD have a job concurrency level less than or equal
>> to
>> >> 15. Just because 20 is the max, doesn't mean you should strive for 20.
>> >> * The average number of minutes a project uses per calendar week MUST
>> NOT
>> >> exceed the equivalent of 25 full-time runners (250,000 minutes, or
>> 4,200
>> >> hours).
>> >> * The average number of minutes a project uses in any consecutive
>> five-day
>> >> period MUST NOT exceed the equivalent of 30 full-time runners (216,000
>> >> minutes, or 3,600 hours).
>> >> * Projects whose builds consistently cross the maximum use limits will
>> >> lose their access to GitHub Actions until they fix their build
>> >> configurations.
>> >>
>> >> Those numbers on their own do not tell much, but we can easily see what
>> >> they mean when we put them side-by-side t with "our" current numbers.
>> >>
>> >> * Currently - with all the "public" usage we are at 8 full-time
>> runners.
>> >> This is after some of the changes I've done, With the recent changes I
>> >> already moved a lot of the non-essential build components that do not
>> >> require a lot of parallelism to public runners.
>> >> * The 20/15 jobs limit is a bit artificial (not really enforceable on
>> >> workflow level) - but in our case as I optimized most PR to run just a
>> >> subset of the tests, The average will be way below that - no matter if
>> you
>> >> are committer or not, regular PRs are far smaller subset of the jobs
>> than
>> >> full "canary" build. And for canary builds we should stay - at least
>> for
>> >> now - with self-hosted runners.
>> >>
>> >> Some of the back-of-the envelope calculations of what might happen
>> when we
>> >> switch to "public" for everyone:
>> >>
>> >> Unfortunately, until we enable the experiment, I do not have an easy
>> way
>> >> to distinguish the "canary" from "committer" runs so those are a bit
>> >> guesses. But our self-hosted build time vs. public build time is ~ 20%
>> more
>> >> for self-hosted (100.000 minutes vs. 80.000 minutes this month) - see
>> the
>> >> attached screenshot for the current month.
>> >> As you can see - building images are already moved to public runners
>> for
>> >> everyone as of two weeks or so, so that will not change.
>> >>
>> >> Taking into account that self-hosted ones are ~ 1.7x faster, this means
>> >> that currently we have ~ 2x more self-hosted time used than public. We
>> can
>> >> assume that 50% of that are committer PRs and "Canary" builds are the
>> >> second half (sounds safe because canary builds use way more resources,
>> even
>> >> if committers run many more PRs than merges).
>> >> So by moving committer builds to public runners, we will - likely -
>> >> increase our public time 2x (from 8 FT runners to 16 FT runners) - way
>> >> below the 25 FT runners that is the "cap" from INFRA, Even if we move
>> all
>> >> Canary builds there, we should be at most at ~24 FTs, which is still
>> below
>> >> the limits. but would be dangerously close to it. That's why I want to
>> keep
>> >> canary builds as self-hosted until we can get some clarity on the "PR"
>> >> moving impact.
>> >>
>> >> We will see the final numbers when we move, but I think we are pretty
>> safe
>> >> within the limits.
>> >>
>> >> J.
>> >>
>> >>
>> >> On Fri, Apr 5, 2024 at 1:16 PM Hussein Awala <huss...@awala.fr<mailto:
>> >> huss...@awala.fr>> wrote:
>> >> Although 900 runners seem like a lot, they are shared among the Apache
>> >> organization's 2.2k repositories, of course only a few of them are
>> active
>> >> (let's say 50), and some of them use an external CI tool for big jobs
>> (eg:
>> >> Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very
>> >> active repositories based entirely on GHA, for example, Iceberg, Spark,
>> >> Superset, ...
>> >>
>> >> I haven't found the AFS runners metrics dashboard to check the max
>> >> concurrency and the max queued time during peak hours, but I'm sure
>> that
>> >> moving Airflow committers' CI jobs to public runners will put some
>> pressure
>> >> on these runners, especially since these committers are the most active
>> >> contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs
>> and 64
>> >> GB RAM) are used almost all the time, so we can say that we will need
>> >> around 70 AFS runners to run the same jobs.
>> >>
>> >> There is no harm in testing and deciding after 2-3 weeks.
>> >>
>> >> We also need to find a way to let the infra team help us solve the
>> >> connectivity problem with the ARC runners
>> >> <
>> >>
>> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme
>> >>>
>> >> .
>> >>
>> >> +1 for testing what you propose.
>> >>
>> >> On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <amoghdesai....@gmail.com
>> >> <mailto:amoghdesai....@gmail.com>>
>> >> wrote:
>> >>
>> >>> +1 I like the idea.
>> >>> Looking forward to seeing the difference.
>> >>>
>> >>> Thanks & Regards,
>> >>> Amogh Desai
>> >>>
>> >>>
>> >>> On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis
>> >>> <ferru...@amazon.com.invalid>
>> >>> wrote:
>> >>>
>> >>>> Interested in seeing the difference, +1
>> >>>>
>> >>>>
>> >>>> - ferruzzi
>> >>>>
>> >>>>
>> >>>> ________________________________
>> >>>> From: Oliveira, Niko <oniko...@amazon.com.INVALID>
>> >>>> Sent: Thursday, April 4, 2024 2:00 PM
>> >>>> To: dev@airflow.apache.org<mailto:dev@airflow.apache.org>
>> >>>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider
>> disabling
>> >>>> self-hosted runners for commiter PRs
>> >>>>
>> >>>> CAUTION: This email originated from outside of the organization. Do
>> not
>> >>>> click links or open attachments unless you can confirm the sender and
>> >>> know
>> >>>> the content is safe.
>> >>>>
>> >>>>
>> >>>>
>> >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> >> externe.
>> >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>> >>> pouvez
>> >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
>> certain
>> >>> que
>> >>>> le contenu ne présente aucun risque.
>> >>>>
>> >>>>
>> >>>>
>> >>>> +1I'd love to see this as well.
>> >>>>
>> >>>> In the past, stability and long queue times of PR builds have been
>> very
>> >>>> frustrating. I'm not 100% sure this is due to using self hosted
>> >> runners,
>> >>>> since 35 queue depth (to my mind) should be plenty. But something
>> about
>> >>>> that setup has never seemed quite right to me with queuing. Switching
>> >> to
>> >>>> public runners for a while to experiment would be great to see if it
>> >>>> improves.
>> >>>>
>> >>>> ________________________________
>> >>>> From: Pankaj Koti <pankaj.k...@astronomer.io<mailto:
>> >> pankaj.k...@astronomer.io>.INVALID>
>> >>>> Sent: Thursday, April 4, 2024 12:41:02 PM
>> >>>> To: dev@airflow.apache.org<mailto:dev@airflow.apache.org>
>> >>>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider
>> disabling
>> >>>> self-hosted runners for commiter PRs
>> >>>>
>> >>>> CAUTION: This email originated from outside of the organization. Do
>> not
>> >>>> click links or open attachments unless you can confirm the sender and
>> >>> know
>> >>>> the content is safe.
>> >>>>
>> >>>>
>> >>>>
>> >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> >> externe.
>> >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>> >>> pouvez
>> >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
>> certain
>> >>> que
>> >>>> le contenu ne présente aucun risque.
>> >>>>
>> >>>>
>> >>>>
>> >>>> +1 from me to this idea.
>> >>>>
>> >>>> Sounds very reasonable to me.
>> >>>> At times, my experience has been better with public runners instead
>> of
>> >>>> self-hosted runners :)
>> >>>>
>> >>>> And like already mentioned in the discussion, I think having the
>> >> ability
>> >>> of
>> >>>> a applying the label "use-self-hosted-runners" to be used for
>> critical
>> >>>> times would be nice to have too.
>> >>>>
>> >>>>
>> >>>> On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com<mailto:
>> >> ja...@potiuk.com>> wrote:
>> >>>>
>> >>>>> Hello everyone,
>> >>>>>
>> >>>>> TL;DR With some recent changes in GitHub Actions and the fact that
>> >> ASF
>> >>>> has
>> >>>>> a lot of runners available donated for all the builds, I think we
>> >> could
>> >>>>> experiment with disabling "self-hosted" runners for committer
>> builds.
>> >>>>>
>> >>>>> The self-hosted runners of ours have been extremely helpful (and we
>> >>>> should
>> >>>>> again thank Amazon and Astronomer for donating credits / money for
>> >>>> those) -
>> >>>>> when the Github Public runners have been far less powerful - and we
>> >> had
>> >>>>> less number of those available for ASF projects. This saved us a LOT
>> >> of
>> >>>>> troubles where there was a contention between ASF projects.
>> >>>>>
>> >>>>> But as of recently both limitations have been largely removed:
>> >>>>>
>> >>>>> * ASF has 900 public runners donated by GitHub to all projects
>> >>>>> * Those public runners have (as of January) for open-source projects
>> >>> now
>> >>>>> have 4 CPUS and 16GB of memory -
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
>> >>>>>
>> >>>>>
>> >>>>> While they are not as powerful as our self-hosted runners, the
>> >>>> parallelism
>> >>>>> we utilise for those brings those builds in not-that bad shape
>> >> compared
>> >>>> to
>> >>>>> self-hosted runners. Typical differences between the public and
>> >>>> self-hosted
>> >>>>> runners now for the complete set of tests are ~ 20m for public
>> >> runners
>> >>>> and
>> >>>>> ~14 m for self-hosted ones.
>> >>>>>
>> >>>>> But this is not the only factor - I think committers experience the
>> >>> "Job
>> >>>>> failed" for self-hosted runners generally much more often than
>> >>>>> non-committers (stability of our solution is not best, also we are
>> >>> using
>> >>>>> cheaper spot instances). Plus - we limit the total number of
>> >>> self-hosted
>> >>>>> runners (35) - so if several committers submit a few PRs and we have
>> >>>> canary
>> >>>>> build running, the jobs will wait until runners are available.
>> >>>>>
>> >>>>> And of course it costs the credits/money of sponsors which we could
>> >> use
>> >>>> for
>> >>>>> other things.
>> >>>>>
>> >>>>> I have - as of recently - access to Github Actions metrics - and
>> >> while
>> >>>> ASF
>> >>>>> is keeping an eye and stared limiting the number of parallel jobs
>> >>>> workflows
>> >>>>> in projects are run, it looks like even if all committer runs are
>> >> added
>> >>>> to
>> >>>>> the public runners, we will still cause far lower usage that the
>> >> limits
>> >>>> are
>> >>>>> and far lower than some other projects (which I will not name
>> >> here).  I
>> >>>>> have access to the metrics so I can monitor our usage and react.
>> >>>>>
>> >>>>> I think possibly - if we switch committers to "public" runners by
>> >>> default
>> >>>>> -the experience will not be much worse for them (and sometimes even
>> >>>> better
>> >>>>> - because of stability/limited queue).
>> >>>>>
>> >>>>> I was planning this carefully - I made a number of refactors/changes
>> >> to
>> >>>> our
>> >>>>> workflows recently that makes it way easier to manipulate the
>> >>>> configuration
>> >>>>> and get various conditions applied to various jobs - so
>> >>>>> changing/experimenting with those settings should be - well - a
>> >> breeze
>> >>>> :).
>> >>>>> Few recent changes had proven that this change and workflow refactor
>> >>> were
>> >>>>> definitely worth the effort, I feel like I finally got a control
>> over
>> >>> it
>> >>>>> where previously it was a bit like herding a pack of cats (which I
>> >>>>> brought to live by myself, but that's another story).
>> >>>>>
>> >>>>> I would like to propose to run an experiment and see how it works if
>> >> we
>> >>>>> switch committer PRs back to the public runners - leaving the
>> >>> self-hosted
>> >>>>> runners only for canary builds (which makes perfect sense because
>> >> those
>> >>>>> builds run a full set of tests and we need as much speed and power
>> >>> there
>> >>>> as
>> >>>>> we can.
>> >>>>>
>> >>>>> This is pretty safe, We should be able to switch back very easily if
>> >> we
>> >>>> see
>> >>>>> problems. I will also monitor it and see if our usage is within the
>> >>>> limits
>> >>>>> of the ASF. I can also add the feature that committers should be
>> able
>> >>> to
>> >>>>> use self-hosted runners by applying the "use self-hosted runners"
>> >> label
>> >>>> to
>> >>>>> a PR.
>> >>>>>
>> >>>>> Running it for 2-3 weeks should be enough to gather experience from
>> >>>>> committers - whether things will seem better or worse for them - or
>> >>> maybe
>> >>>>> they won't really notice a big difference.
>> >>>>>
>> >>>>> Later we could consider some next steps - disabling the self-hosted
>> >>>> runners
>> >>>>> for canary builds if we see that our usage is low and build are fast
>> >>>>> enough, eventually possibly removing current self-hosted runners and
>> >>>>> switching to a better k8s based infrastructure (which we are close
>> to
>> >>> do
>> >>>>> but it makes it a bit difficult while current self-hosted solution
>> is
>> >>> so
>> >>>>> critical to keep it running (like rebuilding the plane while it is
>> >>>> flying).
>> >>>>> I'd love to do it gradually in the "change slowly and observe" mode
>> -
>> >>>>> especially now that I have access to "proper" metrics.
>> >>>>>
>> >>>>> WDYT?
>> >>>>>
>> >>>>> J.
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>
>>

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Reply via email to