Seeing no big "no's" - I will prepare and run the experiment - starting some time next week, after we get 2.9.0 out - I do not want to break anything there. In the meantime, preparatory PR to add "use self-hosted runners" label is out https://github.com/apache/airflow/pull/38779
On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar <rbish...@amazon.com.invalid> wrote: > +1 with trying this out. I agree with keeping the canary builds > self-hosted in order to validate the usage for the PRs. > > -- Rajesh > > > From: Jarek Potiuk <ja...@potiuk.com> > Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org> > Date: Friday, April 5, 2024 at 8:36 AM > To: "dev@airflow.apache.org" <dev@airflow.apache.org> > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling > self-hosted runners for commiter PRs > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que > le contenu ne présente aucun risque. > > > Yeah. Valid concerns Hussein. > > And I am happy to share some more information on that. I did not want to > put all of that in the original email, but I see that might be interesting > for you and possibly others. > > I am closely following the numbers now. One of the reasons I am doing / > proposing it now is that finally (after almost 3 years of waiting) we > finally have access to some metrics that we can check. As of last week I > got access to the ASF metrics ( > https://issues.apache.org/jira/browse/INFRA-25662). > > I have access to "organisation" level information. Infra does not want to > open it to everyone - even to every member - but since I got very active > and been helping with a number I got the access granted as an exception. > Also I saw a small dashboard the INFRA prepares to open to everyone once > they sort the access where we will be able to see the "per-project" usage. > > Some stats that I can share (they asked not to share too much). > > From what I looked at I can tell that we are right now (the whole ASF > organisation) safely below the total capacity. With a large margin - enough > to handle spikes, but of course the growth of usage is there and if > uncontrolled - we can again reach the same situation that triggered getting > self-hosted runners a few years ago. > > Luckily - INRA gets it under control this time |(and metrics will help). > In the last INFRA newsletter, they announced some limitations that will > apply to the projects (effective as of end of April) - so once those will > be followed, we should be "safe" from being impacted by others (i.e. > noisy-neighbour effect). Some of the projects (not Airflow (!) ) were > exceeding those so far and they will be capped - they will need to optimize > their builds eventually. > > Those are the rules: > > * All workflows MUST have a job concurrency level less than or equal to > 20. This means a workflow cannot have more than 20 jobs running at the same > time across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to > 15. Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > * Projects whose builds consistently cross the maximum use limits will > lose their access to GitHub Actions until they fix their build > configurations. > > Those numbers on their own do not tell much, but we can easily see what > they mean when we put them side-by-side t with "our" current numbers. > > * Currently - with all the "public" usage we are at 8 full-time runners. > This is after some of the changes I've done, With the recent changes I > already moved a lot of the non-essential build components that do not > require a lot of parallelism to public runners. > * The 20/15 jobs limit is a bit artificial (not really enforceable on > workflow level) - but in our case as I optimized most PR to run just a > subset of the tests, The average will be way below that - no matter if you > are committer or not, regular PRs are far smaller subset of the jobs than > full "canary" build. And for canary builds we should stay - at least for > now - with self-hosted runners. > > Some of the back-of-the envelope calculations of what might happen when we > switch to "public" for everyone: > > Unfortunately, until we enable the experiment, I do not have an easy way > to distinguish the "canary" from "committer" runs so those are a bit > guesses. But our self-hosted build time vs. public build time is ~ 20% more > for self-hosted (100.000 minutes vs. 80.000 minutes this month) - see the > attached screenshot for the current month. > As you can see - building images are already moved to public runners for > everyone as of two weeks or so, so that will not change. > > Taking into account that self-hosted ones are ~ 1.7x faster, this means > that currently we have ~ 2x more self-hosted time used than public. We can > assume that 50% of that are committer PRs and "Canary" builds are the > second half (sounds safe because canary builds use way more resources, even > if committers run many more PRs than merges). > So by moving committer builds to public runners, we will - likely - > increase our public time 2x (from 8 FT runners to 16 FT runners) - way > below the 25 FT runners that is the "cap" from INFRA, Even if we move all > Canary builds there, we should be at most at ~24 FTs, which is still below > the limits. but would be dangerously close to it. That's why I want to keep > canary builds as self-hosted until we can get some clarity on the "PR" > moving impact. > > We will see the final numbers when we move, but I think we are pretty safe > within the limits. > > J. > > > On Fri, Apr 5, 2024 at 1:16 PM Hussein Awala <huss...@awala.fr<mailto: > huss...@awala.fr>> wrote: > Although 900 runners seem like a lot, they are shared among the Apache > organization's 2.2k repositories, of course only a few of them are active > (let's say 50), and some of them use an external CI tool for big jobs (eg: > Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very > active repositories based entirely on GHA, for example, Iceberg, Spark, > Superset, ... > > I haven't found the AFS runners metrics dashboard to check the max > concurrency and the max queued time during peak hours, but I'm sure that > moving Airflow committers' CI jobs to public runners will put some pressure > on these runners, especially since these committers are the most active > contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs and 64 > GB RAM) are used almost all the time, so we can say that we will need > around 70 AFS runners to run the same jobs. > > There is no harm in testing and deciding after 2-3 weeks. > > We also need to find a way to let the infra team help us solve the > connectivity problem with the ARC runners > < > https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme > > > . > > +1 for testing what you propose. > > On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <amoghdesai....@gmail.com > <mailto:amoghdesai....@gmail.com>> > wrote: > > > +1 I like the idea. > > Looking forward to seeing the difference. > > > > Thanks & Regards, > > Amogh Desai > > > > > > On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis > > <ferru...@amazon.com.invalid> > > wrote: > > > > > Interested in seeing the difference, +1 > > > > > > > > > - ferruzzi > > > > > > > > > ________________________________ > > > From: Oliveira, Niko <oniko...@amazon.com.INVALID> > > > Sent: Thursday, April 4, 2024 2:00 PM > > > To: dev@airflow.apache.org<mailto:dev@airflow.apache.org> > > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling > > > self-hosted runners for commiter PRs > > > > > > CAUTION: This email originated from outside of the organization. Do not > > > click links or open attachments unless you can confirm the sender and > > know > > > the content is safe. > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > externe. > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > > pouvez > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain > > que > > > le contenu ne présente aucun risque. > > > > > > > > > > > > +1I'd love to see this as well. > > > > > > In the past, stability and long queue times of PR builds have been very > > > frustrating. I'm not 100% sure this is due to using self hosted > runners, > > > since 35 queue depth (to my mind) should be plenty. But something about > > > that setup has never seemed quite right to me with queuing. Switching > to > > > public runners for a while to experiment would be great to see if it > > > improves. > > > > > > ________________________________ > > > From: Pankaj Koti <pankaj.k...@astronomer.io<mailto: > pankaj.k...@astronomer.io>.INVALID> > > > Sent: Thursday, April 4, 2024 12:41:02 PM > > > To: dev@airflow.apache.org<mailto:dev@airflow.apache.org> > > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling > > > self-hosted runners for commiter PRs > > > > > > CAUTION: This email originated from outside of the organization. Do not > > > click links or open attachments unless you can confirm the sender and > > know > > > the content is safe. > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > externe. > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > > pouvez > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain > > que > > > le contenu ne présente aucun risque. > > > > > > > > > > > > +1 from me to this idea. > > > > > > Sounds very reasonable to me. > > > At times, my experience has been better with public runners instead of > > > self-hosted runners :) > > > > > > And like already mentioned in the discussion, I think having the > ability > > of > > > a applying the label "use-self-hosted-runners" to be used for critical > > > times would be nice to have too. > > > > > > > > > On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com<mailto: > ja...@potiuk.com>> wrote: > > > > > > > Hello everyone, > > > > > > > > TL;DR With some recent changes in GitHub Actions and the fact that > ASF > > > has > > > > a lot of runners available donated for all the builds, I think we > could > > > > experiment with disabling "self-hosted" runners for committer builds. > > > > > > > > The self-hosted runners of ours have been extremely helpful (and we > > > should > > > > again thank Amazon and Astronomer for donating credits / money for > > > those) - > > > > when the Github Public runners have been far less powerful - and we > had > > > > less number of those available for ASF projects. This saved us a LOT > of > > > > troubles where there was a contention between ASF projects. > > > > > > > > But as of recently both limitations have been largely removed: > > > > > > > > * ASF has 900 public runners donated by GitHub to all projects > > > > * Those public runners have (as of January) for open-source projects > > now > > > > have 4 CPUS and 16GB of memory - > > > > > > > > > > > > > > https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/ > > > > > > > > > > > > While they are not as powerful as our self-hosted runners, the > > > parallelism > > > > we utilise for those brings those builds in not-that bad shape > compared > > > to > > > > self-hosted runners. Typical differences between the public and > > > self-hosted > > > > runners now for the complete set of tests are ~ 20m for public > runners > > > and > > > > ~14 m for self-hosted ones. > > > > > > > > But this is not the only factor - I think committers experience the > > "Job > > > > failed" for self-hosted runners generally much more often than > > > > non-committers (stability of our solution is not best, also we are > > using > > > > cheaper spot instances). Plus - we limit the total number of > > self-hosted > > > > runners (35) - so if several committers submit a few PRs and we have > > > canary > > > > build running, the jobs will wait until runners are available. > > > > > > > > And of course it costs the credits/money of sponsors which we could > use > > > for > > > > other things. > > > > > > > > I have - as of recently - access to Github Actions metrics - and > while > > > ASF > > > > is keeping an eye and stared limiting the number of parallel jobs > > > workflows > > > > in projects are run, it looks like even if all committer runs are > added > > > to > > > > the public runners, we will still cause far lower usage that the > limits > > > are > > > > and far lower than some other projects (which I will not name > here). I > > > > have access to the metrics so I can monitor our usage and react. > > > > > > > > I think possibly - if we switch committers to "public" runners by > > default > > > > -the experience will not be much worse for them (and sometimes even > > > better > > > > - because of stability/limited queue). > > > > > > > > I was planning this carefully - I made a number of refactors/changes > to > > > our > > > > workflows recently that makes it way easier to manipulate the > > > configuration > > > > and get various conditions applied to various jobs - so > > > > changing/experimenting with those settings should be - well - a > breeze > > > :). > > > > Few recent changes had proven that this change and workflow refactor > > were > > > > definitely worth the effort, I feel like I finally got a control over > > it > > > > where previously it was a bit like herding a pack of cats (which I > > > > brought to live by myself, but that's another story). > > > > > > > > I would like to propose to run an experiment and see how it works if > we > > > > switch committer PRs back to the public runners - leaving the > > self-hosted > > > > runners only for canary builds (which makes perfect sense because > those > > > > builds run a full set of tests and we need as much speed and power > > there > > > as > > > > we can. > > > > > > > > This is pretty safe, We should be able to switch back very easily if > we > > > see > > > > problems. I will also monitor it and see if our usage is within the > > > limits > > > > of the ASF. I can also add the feature that committers should be able > > to > > > > use self-hosted runners by applying the "use self-hosted runners" > label > > > to > > > > a PR. > > > > > > > > Running it for 2-3 weeks should be enough to gather experience from > > > > committers - whether things will seem better or worse for them - or > > maybe > > > > they won't really notice a big difference. > > > > > > > > Later we could consider some next steps - disabling the self-hosted > > > runners > > > > for canary builds if we see that our usage is low and build are fast > > > > enough, eventually possibly removing current self-hosted runners and > > > > switching to a better k8s based infrastructure (which we are close to > > do > > > > but it makes it a bit difficult while current self-hosted solution is > > so > > > > critical to keep it running (like rebuilding the plane while it is > > > flying). > > > > I'd love to do it gradually in the "change slowly and observe" mode - > > > > especially now that I have access to "proper" metrics. > > > > > > > > WDYT? > > > > > > > > J. > > > > > > > > > >