The change is merged, rebasing should trigger maintainers PRs using public runners. they should be able to switch to "self-hosted" by "use self hosted runners" label. The `main` and `v2-9-test` runs should still be run using self-hosted runners.
I would love to hear back from the maintainers if that helps with their experience. On Thu, Apr 18, 2024 at 10:59 AM Jarek Potiuk <ja...@potiuk.com> wrote: > PR switching it here: https://github.com/apache/airflow/pull/39106 - > sorry for the delay in following up on that one. > > J. > > On Fri, Apr 5, 2024 at 6:08 PM Wei Lee <weilee...@gmail.com> wrote: > >> +1 for this. I do not yet have enough chance to experience many job >> failures, but it won’t harm us to test them out. Plus, it saves some of the >> cost. >> >> Best, >> Wei >> >> > On Apr 5, 2024, at 11:36 PM, Jarek Potiuk <ja...@potiuk.com> wrote: >> > >> > Seeing no big "no's" - I will prepare and run the experiment - starting >> > some time next week, after we get 2.9.0 out - I do not want to break >> > anything there. In the meantime, preparatory PR to add "use self-hosted >> > runners" label is out https://github.com/apache/airflow/pull/38779 >> > >> > On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar >> > <rbish...@amazon.com.invalid> wrote: >> > >> >> +1 with trying this out. I agree with keeping the canary builds >> >> self-hosted in order to validate the usage for the PRs. >> >> >> >> -- Rajesh >> >> >> >> >> >> From: Jarek Potiuk <ja...@potiuk.com> >> >> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org> >> >> Date: Friday, April 5, 2024 at 8:36 AM >> >> To: "dev@airflow.apache.org" <dev@airflow.apache.org> >> >> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling >> >> self-hosted runners for commiter PRs >> >> >> >> >> >> CAUTION: This email originated from outside of the organization. Do not >> >> click links or open attachments unless you can confirm the sender and >> know >> >> the content is safe. >> >> >> >> >> >> >> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur >> externe. >> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne >> pouvez >> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain >> que >> >> le contenu ne présente aucun risque. >> >> >> >> >> >> Yeah. Valid concerns Hussein. >> >> >> >> And I am happy to share some more information on that. I did not want >> to >> >> put all of that in the original email, but I see that might be >> interesting >> >> for you and possibly others. >> >> >> >> I am closely following the numbers now. One of the reasons I am doing / >> >> proposing it now is that finally (after almost 3 years of waiting) we >> >> finally have access to some metrics that we can check. As of last week >> I >> >> got access to the ASF metrics ( >> >> https://issues.apache.org/jira/browse/INFRA-25662). >> >> >> >> I have access to "organisation" level information. Infra does not want >> to >> >> open it to everyone - even to every member - but since I got very >> active >> >> and been helping with a number I got the access granted as an >> exception. >> >> Also I saw a small dashboard the INFRA prepares to open to everyone >> once >> >> they sort the access where we will be able to see the "per-project" >> usage. >> >> >> >> Some stats that I can share (they asked not to share too much). >> >> >> >> From what I looked at I can tell that we are right now (the whole ASF >> >> organisation) safely below the total capacity. With a large margin - >> enough >> >> to handle spikes, but of course the growth of usage is there and if >> >> uncontrolled - we can again reach the same situation that triggered >> getting >> >> self-hosted runners a few years ago. >> >> >> >> Luckily - INRA gets it under control this time |(and metrics will >> help). >> >> In the last INFRA newsletter, they announced some limitations that will >> >> apply to the projects (effective as of end of April) - so once those >> will >> >> be followed, we should be "safe" from being impacted by others (i.e. >> >> noisy-neighbour effect). Some of the projects (not Airflow (!) ) were >> >> exceeding those so far and they will be capped - they will need to >> optimize >> >> their builds eventually. >> >> >> >> Those are the rules: >> >> >> >> * All workflows MUST have a job concurrency level less than or equal to >> >> 20. This means a workflow cannot have more than 20 jobs running at the >> same >> >> time across all matrices. >> >> * All workflows SHOULD have a job concurrency level less than or equal >> to >> >> 15. Just because 20 is the max, doesn't mean you should strive for 20. >> >> * The average number of minutes a project uses per calendar week MUST >> NOT >> >> exceed the equivalent of 25 full-time runners (250,000 minutes, or >> 4,200 >> >> hours). >> >> * The average number of minutes a project uses in any consecutive >> five-day >> >> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 >> >> minutes, or 3,600 hours). >> >> * Projects whose builds consistently cross the maximum use limits will >> >> lose their access to GitHub Actions until they fix their build >> >> configurations. >> >> >> >> Those numbers on their own do not tell much, but we can easily see what >> >> they mean when we put them side-by-side t with "our" current numbers. >> >> >> >> * Currently - with all the "public" usage we are at 8 full-time >> runners. >> >> This is after some of the changes I've done, With the recent changes I >> >> already moved a lot of the non-essential build components that do not >> >> require a lot of parallelism to public runners. >> >> * The 20/15 jobs limit is a bit artificial (not really enforceable on >> >> workflow level) - but in our case as I optimized most PR to run just a >> >> subset of the tests, The average will be way below that - no matter if >> you >> >> are committer or not, regular PRs are far smaller subset of the jobs >> than >> >> full "canary" build. And for canary builds we should stay - at least >> for >> >> now - with self-hosted runners. >> >> >> >> Some of the back-of-the envelope calculations of what might happen >> when we >> >> switch to "public" for everyone: >> >> >> >> Unfortunately, until we enable the experiment, I do not have an easy >> way >> >> to distinguish the "canary" from "committer" runs so those are a bit >> >> guesses. But our self-hosted build time vs. public build time is ~ 20% >> more >> >> for self-hosted (100.000 minutes vs. 80.000 minutes this month) - see >> the >> >> attached screenshot for the current month. >> >> As you can see - building images are already moved to public runners >> for >> >> everyone as of two weeks or so, so that will not change. >> >> >> >> Taking into account that self-hosted ones are ~ 1.7x faster, this means >> >> that currently we have ~ 2x more self-hosted time used than public. We >> can >> >> assume that 50% of that are committer PRs and "Canary" builds are the >> >> second half (sounds safe because canary builds use way more resources, >> even >> >> if committers run many more PRs than merges). >> >> So by moving committer builds to public runners, we will - likely - >> >> increase our public time 2x (from 8 FT runners to 16 FT runners) - way >> >> below the 25 FT runners that is the "cap" from INFRA, Even if we move >> all >> >> Canary builds there, we should be at most at ~24 FTs, which is still >> below >> >> the limits. but would be dangerously close to it. That's why I want to >> keep >> >> canary builds as self-hosted until we can get some clarity on the "PR" >> >> moving impact. >> >> >> >> We will see the final numbers when we move, but I think we are pretty >> safe >> >> within the limits. >> >> >> >> J. >> >> >> >> >> >> On Fri, Apr 5, 2024 at 1:16 PM Hussein Awala <huss...@awala.fr<mailto: >> >> huss...@awala.fr>> wrote: >> >> Although 900 runners seem like a lot, they are shared among the Apache >> >> organization's 2.2k repositories, of course only a few of them are >> active >> >> (let's say 50), and some of them use an external CI tool for big jobs >> (eg: >> >> Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very >> >> active repositories based entirely on GHA, for example, Iceberg, Spark, >> >> Superset, ... >> >> >> >> I haven't found the AFS runners metrics dashboard to check the max >> >> concurrency and the max queued time during peak hours, but I'm sure >> that >> >> moving Airflow committers' CI jobs to public runners will put some >> pressure >> >> on these runners, especially since these committers are the most active >> >> contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs >> and 64 >> >> GB RAM) are used almost all the time, so we can say that we will need >> >> around 70 AFS runners to run the same jobs. >> >> >> >> There is no harm in testing and deciding after 2-3 weeks. >> >> >> >> We also need to find a way to let the infra team help us solve the >> >> connectivity problem with the ARC runners >> >> < >> >> >> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme >> >>> >> >> . >> >> >> >> +1 for testing what you propose. >> >> >> >> On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <amoghdesai....@gmail.com >> >> <mailto:amoghdesai....@gmail.com>> >> >> wrote: >> >> >> >>> +1 I like the idea. >> >>> Looking forward to seeing the difference. >> >>> >> >>> Thanks & Regards, >> >>> Amogh Desai >> >>> >> >>> >> >>> On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis >> >>> <ferru...@amazon.com.invalid> >> >>> wrote: >> >>> >> >>>> Interested in seeing the difference, +1 >> >>>> >> >>>> >> >>>> - ferruzzi >> >>>> >> >>>> >> >>>> ________________________________ >> >>>> From: Oliveira, Niko <oniko...@amazon.com.INVALID> >> >>>> Sent: Thursday, April 4, 2024 2:00 PM >> >>>> To: dev@airflow.apache.org<mailto:dev@airflow.apache.org> >> >>>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider >> disabling >> >>>> self-hosted runners for commiter PRs >> >>>> >> >>>> CAUTION: This email originated from outside of the organization. Do >> not >> >>>> click links or open attachments unless you can confirm the sender and >> >>> know >> >>>> the content is safe. >> >>>> >> >>>> >> >>>> >> >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur >> >> externe. >> >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne >> >>> pouvez >> >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas >> certain >> >>> que >> >>>> le contenu ne présente aucun risque. >> >>>> >> >>>> >> >>>> >> >>>> +1I'd love to see this as well. >> >>>> >> >>>> In the past, stability and long queue times of PR builds have been >> very >> >>>> frustrating. I'm not 100% sure this is due to using self hosted >> >> runners, >> >>>> since 35 queue depth (to my mind) should be plenty. But something >> about >> >>>> that setup has never seemed quite right to me with queuing. Switching >> >> to >> >>>> public runners for a while to experiment would be great to see if it >> >>>> improves. >> >>>> >> >>>> ________________________________ >> >>>> From: Pankaj Koti <pankaj.k...@astronomer.io<mailto: >> >> pankaj.k...@astronomer.io>.INVALID> >> >>>> Sent: Thursday, April 4, 2024 12:41:02 PM >> >>>> To: dev@airflow.apache.org<mailto:dev@airflow.apache.org> >> >>>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider >> disabling >> >>>> self-hosted runners for commiter PRs >> >>>> >> >>>> CAUTION: This email originated from outside of the organization. Do >> not >> >>>> click links or open attachments unless you can confirm the sender and >> >>> know >> >>>> the content is safe. >> >>>> >> >>>> >> >>>> >> >>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur >> >> externe. >> >>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne >> >>> pouvez >> >>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas >> certain >> >>> que >> >>>> le contenu ne présente aucun risque. >> >>>> >> >>>> >> >>>> >> >>>> +1 from me to this idea. >> >>>> >> >>>> Sounds very reasonable to me. >> >>>> At times, my experience has been better with public runners instead >> of >> >>>> self-hosted runners :) >> >>>> >> >>>> And like already mentioned in the discussion, I think having the >> >> ability >> >>> of >> >>>> a applying the label "use-self-hosted-runners" to be used for >> critical >> >>>> times would be nice to have too. >> >>>> >> >>>> >> >>>> On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com<mailto: >> >> ja...@potiuk.com>> wrote: >> >>>> >> >>>>> Hello everyone, >> >>>>> >> >>>>> TL;DR With some recent changes in GitHub Actions and the fact that >> >> ASF >> >>>> has >> >>>>> a lot of runners available donated for all the builds, I think we >> >> could >> >>>>> experiment with disabling "self-hosted" runners for committer >> builds. >> >>>>> >> >>>>> The self-hosted runners of ours have been extremely helpful (and we >> >>>> should >> >>>>> again thank Amazon and Astronomer for donating credits / money for >> >>>> those) - >> >>>>> when the Github Public runners have been far less powerful - and we >> >> had >> >>>>> less number of those available for ASF projects. This saved us a LOT >> >> of >> >>>>> troubles where there was a contention between ASF projects. >> >>>>> >> >>>>> But as of recently both limitations have been largely removed: >> >>>>> >> >>>>> * ASF has 900 public runners donated by GitHub to all projects >> >>>>> * Those public runners have (as of January) for open-source projects >> >>> now >> >>>>> have 4 CPUS and 16GB of memory - >> >>>>> >> >>>>> >> >>>> >> >>> >> >> >> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/ >> >>>>> >> >>>>> >> >>>>> While they are not as powerful as our self-hosted runners, the >> >>>> parallelism >> >>>>> we utilise for those brings those builds in not-that bad shape >> >> compared >> >>>> to >> >>>>> self-hosted runners. Typical differences between the public and >> >>>> self-hosted >> >>>>> runners now for the complete set of tests are ~ 20m for public >> >> runners >> >>>> and >> >>>>> ~14 m for self-hosted ones. >> >>>>> >> >>>>> But this is not the only factor - I think committers experience the >> >>> "Job >> >>>>> failed" for self-hosted runners generally much more often than >> >>>>> non-committers (stability of our solution is not best, also we are >> >>> using >> >>>>> cheaper spot instances). Plus - we limit the total number of >> >>> self-hosted >> >>>>> runners (35) - so if several committers submit a few PRs and we have >> >>>> canary >> >>>>> build running, the jobs will wait until runners are available. >> >>>>> >> >>>>> And of course it costs the credits/money of sponsors which we could >> >> use >> >>>> for >> >>>>> other things. >> >>>>> >> >>>>> I have - as of recently - access to Github Actions metrics - and >> >> while >> >>>> ASF >> >>>>> is keeping an eye and stared limiting the number of parallel jobs >> >>>> workflows >> >>>>> in projects are run, it looks like even if all committer runs are >> >> added >> >>>> to >> >>>>> the public runners, we will still cause far lower usage that the >> >> limits >> >>>> are >> >>>>> and far lower than some other projects (which I will not name >> >> here). I >> >>>>> have access to the metrics so I can monitor our usage and react. >> >>>>> >> >>>>> I think possibly - if we switch committers to "public" runners by >> >>> default >> >>>>> -the experience will not be much worse for them (and sometimes even >> >>>> better >> >>>>> - because of stability/limited queue). >> >>>>> >> >>>>> I was planning this carefully - I made a number of refactors/changes >> >> to >> >>>> our >> >>>>> workflows recently that makes it way easier to manipulate the >> >>>> configuration >> >>>>> and get various conditions applied to various jobs - so >> >>>>> changing/experimenting with those settings should be - well - a >> >> breeze >> >>>> :). >> >>>>> Few recent changes had proven that this change and workflow refactor >> >>> were >> >>>>> definitely worth the effort, I feel like I finally got a control >> over >> >>> it >> >>>>> where previously it was a bit like herding a pack of cats (which I >> >>>>> brought to live by myself, but that's another story). >> >>>>> >> >>>>> I would like to propose to run an experiment and see how it works if >> >> we >> >>>>> switch committer PRs back to the public runners - leaving the >> >>> self-hosted >> >>>>> runners only for canary builds (which makes perfect sense because >> >> those >> >>>>> builds run a full set of tests and we need as much speed and power >> >>> there >> >>>> as >> >>>>> we can. >> >>>>> >> >>>>> This is pretty safe, We should be able to switch back very easily if >> >> we >> >>>> see >> >>>>> problems. I will also monitor it and see if our usage is within the >> >>>> limits >> >>>>> of the ASF. I can also add the feature that committers should be >> able >> >>> to >> >>>>> use self-hosted runners by applying the "use self-hosted runners" >> >> label >> >>>> to >> >>>>> a PR. >> >>>>> >> >>>>> Running it for 2-3 weeks should be enough to gather experience from >> >>>>> committers - whether things will seem better or worse for them - or >> >>> maybe >> >>>>> they won't really notice a big difference. >> >>>>> >> >>>>> Later we could consider some next steps - disabling the self-hosted >> >>>> runners >> >>>>> for canary builds if we see that our usage is low and build are fast >> >>>>> enough, eventually possibly removing current self-hosted runners and >> >>>>> switching to a better k8s based infrastructure (which we are close >> to >> >>> do >> >>>>> but it makes it a bit difficult while current self-hosted solution >> is >> >>> so >> >>>>> critical to keep it running (like rebuilding the plane while it is >> >>>> flying). >> >>>>> I'd love to do it gradually in the "change slowly and observe" mode >> - >> >>>>> especially now that I have access to "proper" metrics. >> >>>>> >> >>>>> WDYT? >> >>>>> >> >>>>> J. >> >>>>> >> >>>> >> >>> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> For additional commands, e-mail: dev-h...@airflow.apache.org >> >>