Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Wei Lee Fri, 05 Apr 2024 09:09:46 -0700

+1 for this. I do not yet have enough chance to experience many job failures, 
but it won’t harm us to test them out. Plus, it saves some of the cost.


Best,
Wei

> On Apr 5, 2024, at 11:36 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> Seeing no big "no's" - I will prepare and run the experiment - starting
> some time next week, after we get 2.9.0 out - I do not want to break
> anything there. In the meantime, preparatory PR to add "use self-hosted
> runners" label is out https://github.com/apache/airflow/pull/38779
> 
> On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar
> <rbish...@amazon.com.invalid> wrote:
> 
>> +1 with trying this out. I agree with keeping the canary builds
>> self-hosted in order to validate the usage for the PRs.
>> 
>> -- Rajesh
>> 
>> 
>> From: Jarek Potiuk <ja...@potiuk.com>
>> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org>
>> Date: Friday, April 5, 2024 at 8:36 AM
>> To: "dev@airflow.apache.org" <dev@airflow.apache.org>
>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
>> self-hosted runners for commiter PRs
>> 
>> 
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you can confirm the sender and know
>> the content is safe.
>> 
>> 
>> 
>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
>> le contenu ne présente aucun risque.
>> 
>> 
>> Yeah. Valid concerns Hussein.
>> 
>> And I am happy to share some more information on that. I did not want to
>> put all of that in the original email, but I see that might be interesting
>> for you and possibly others.
>> 
>> I am closely following the numbers now. One of the reasons I am doing /
>> proposing it now is that finally (after almost 3 years of waiting) we
>> finally have access to some metrics that we can check. As of last week I
>> got access to the ASF metrics (
>> https://issues.apache.org/jira/browse/INFRA-25662).
>> 
>> I have access to "organisation" level information. Infra does not want to
>> open it to everyone - even to every member -  but since I got very active
>> and been helping with a number I got the access granted as an exception.
>> Also I saw a small dashboard the INFRA prepares to open to everyone once
>> they sort the access where we will be able to see the "per-project" usage.
>> 
>> Some stats that I can share (they asked not to share too much).
>> 
>> From what I looked at I can tell that we are right now (the whole ASF
>> organisation) safely below the total capacity. With a large margin - enough
>> to handle spikes, but of course the growth of usage is there and if
>> uncontrolled - we can again reach the same situation that triggered getting
>> self-hosted runners a few years ago.
>> 
>> Luckily - INRA gets it under control this time |(and metrics will help).
>> In the last INFRA newsletter, they announced some limitations that will
>> apply to the projects (effective as of end of April) - so once those will
>> be followed, we should be "safe" from being impacted by others (i.e.
>> noisy-neighbour effect). Some of the projects (not Airflow (!) ) were
>> exceeding those so far and they will be capped - they will need to optimize
>> their builds eventually.
>> 
>> Those are the rules:
>> 
>> * All workflows MUST have a job concurrency level less than or equal to
>> 20. This means a workflow cannot have more than 20 jobs running at the same
>> time across all matrices.
>> * All workflows SHOULD have a job concurrency level less than or equal to
>> 15. Just because 20 is the max, doesn't mean you should strive for 20.
>> * The average number of minutes a project uses per calendar week MUST NOT
>> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200
>> hours).
>> * The average number of minutes a project uses in any consecutive five-day
>> period MUST NOT exceed the equivalent of 30 full-time runners (216,000
>> minutes, or 3,600 hours).
>> * Projects whose builds consistently cross the maximum use limits will
>> lose their access to GitHub Actions until they fix their build
>> configurations.
>> 
>> Those numbers on their own do not tell much, but we can easily see what
>> they mean when we put them side-by-side t with "our" current numbers.
>> 
>> * Currently - with all the "public" usage we are at 8 full-time runners.
>> This is after some of the changes I've done, With the recent changes I
>> already moved a lot of the non-essential build components that do not
>> require a lot of parallelism to public runners.
>> * The 20/15 jobs limit is a bit artificial (not really enforceable on
>> workflow level) - but in our case as I optimized most PR to run just a
>> subset of the tests, The average will be way below that - no matter if you
>> are committer or not, regular PRs are far smaller subset of the jobs than
>> full "canary" build. And for canary builds we should stay - at least for
>> now - with self-hosted runners.
>> 
>> Some of the back-of-the envelope calculations of what might happen when we
>> switch to "public" for everyone:
>> 
>> Unfortunately, until we enable the experiment, I do not have an easy way
>> to distinguish the "canary" from "committer" runs so those are a bit
>> guesses. But our self-hosted build time vs. public build time is ~ 20% more
>> for self-hosted (100.000 minutes vs. 80.000 minutes this month) - see the
>> attached screenshot for the current month.
>> As you can see - building images are already moved to public runners for
>> everyone as of two weeks or so, so that will not change.
>> 
>> Taking into account that self-hosted ones are ~ 1.7x faster, this means
>> that currently we have ~ 2x more self-hosted time used than public. We can
>> assume that 50% of that are committer PRs and "Canary" builds are the
>> second half (sounds safe because canary builds use way more resources, even
>> if committers run many more PRs than merges).
>> So by moving committer builds to public runners, we will - likely -
>> increase our public time 2x (from 8 FT runners to 16 FT runners) - way
>> below the 25 FT runners that is the "cap" from INFRA, Even if we move all
>> Canary builds there, we should be at most at ~24 FTs, which is still below
>> the limits. but would be dangerously close to it. That's why I want to keep
>> canary builds as self-hosted until we can get some clarity on the "PR"
>> moving impact.
>> 
>> We will see the final numbers when we move, but I think we are pretty safe
>> within the limits.
>> 
>> J.
>> 
>> 
>> On Fri, Apr 5, 2024 at 1:16 PM Hussein Awala <huss...@awala.fr<mailto:
>> huss...@awala.fr>> wrote:
>> Although 900 runners seem like a lot, they are shared among the Apache
>> organization's 2.2k repositories, of course only a few of them are active
>> (let's say 50), and some of them use an external CI tool for big jobs (eg:
>> Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very
>> active repositories based entirely on GHA, for example, Iceberg, Spark,
>> Superset, ...
>> 
>> I haven't found the AFS runners metrics dashboard to check the max
>> concurrency and the max queued time during peak hours, but I'm sure that
>> moving Airflow committers' CI jobs to public runners will put some pressure
>> on these runners, especially since these committers are the most active
>> contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs and 64
>> GB RAM) are used almost all the time, so we can say that we will need
>> around 70 AFS runners to run the same jobs.
>> 
>> There is no harm in testing and deciding after 2-3 weeks.
>> 
>> We also need to find a way to let the infra team help us solve the
>> connectivity problem with the ARC runners
>> <
>> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme
>>> 
>> .
>> 
>> +1 for testing what you propose.
>> 
>> On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <amoghdesai....@gmail.com
>> <mailto:amoghdesai....@gmail.com>>
>> wrote:
>> 
>>> +1 I like the idea.
>>> Looking forward to seeing the difference.
>>> 
>>> Thanks & Regards,
>>> Amogh Desai
>>> 
>>> 
>>> On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis
>>> <ferru...@amazon.com.invalid>
>>> wrote:
>>> 
>>>> Interested in seeing the difference, +1
>>>> 
>>>> 
>>>> - ferruzzi
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Oliveira, Niko <oniko...@amazon.com.INVALID>
>>>> Sent: Thursday, April 4, 2024 2:00 PM
>>>> To: dev@airflow.apache.org<mailto:dev@airflow.apache.org>
>>>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
>>>> self-hosted runners for commiter PRs
>>>> 
>>>> CAUTION: This email originated from outside of the organization. Do not
>>>> click links or open attachments unless you can confirm the sender and
>>> know
>>>> the content is safe.
>>>> 
>>>> 
>>>> 
>>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> externe.
>>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>>> pouvez
>>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
>>> que
>>>> le contenu ne présente aucun risque.
>>>> 
>>>> 
>>>> 
>>>> +1I'd love to see this as well.
>>>> 
>>>> In the past, stability and long queue times of PR builds have been very
>>>> frustrating. I'm not 100% sure this is due to using self hosted
>> runners,
>>>> since 35 queue depth (to my mind) should be plenty. But something about
>>>> that setup has never seemed quite right to me with queuing. Switching
>> to
>>>> public runners for a while to experiment would be great to see if it
>>>> improves.
>>>> 
>>>> ________________________________
>>>> From: Pankaj Koti <pankaj.k...@astronomer.io<mailto:
>> pankaj.k...@astronomer.io>.INVALID>
>>>> Sent: Thursday, April 4, 2024 12:41:02 PM
>>>> To: dev@airflow.apache.org<mailto:dev@airflow.apache.org>
>>>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
>>>> self-hosted runners for commiter PRs
>>>> 
>>>> CAUTION: This email originated from outside of the organization. Do not
>>>> click links or open attachments unless you can confirm the sender and
>>> know
>>>> the content is safe.
>>>> 
>>>> 
>>>> 
>>>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> externe.
>>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>>> pouvez
>>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
>>> que
>>>> le contenu ne présente aucun risque.
>>>> 
>>>> 
>>>> 
>>>> +1 from me to this idea.
>>>> 
>>>> Sounds very reasonable to me.
>>>> At times, my experience has been better with public runners instead of
>>>> self-hosted runners :)
>>>> 
>>>> And like already mentioned in the discussion, I think having the
>> ability
>>> of
>>>> a applying the label "use-self-hosted-runners" to be used for critical
>>>> times would be nice to have too.
>>>> 
>>>> 
>>>> On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com<mailto:
>> ja...@potiuk.com>> wrote:
>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> TL;DR With some recent changes in GitHub Actions and the fact that
>> ASF
>>>> has
>>>>> a lot of runners available donated for all the builds, I think we
>> could
>>>>> experiment with disabling "self-hosted" runners for committer builds.
>>>>> 
>>>>> The self-hosted runners of ours have been extremely helpful (and we
>>>> should
>>>>> again thank Amazon and Astronomer for donating credits / money for
>>>> those) -
>>>>> when the Github Public runners have been far less powerful - and we
>> had
>>>>> less number of those available for ASF projects. This saved us a LOT
>> of
>>>>> troubles where there was a contention between ASF projects.
>>>>> 
>>>>> But as of recently both limitations have been largely removed:
>>>>> 
>>>>> * ASF has 900 public runners donated by GitHub to all projects
>>>>> * Those public runners have (as of January) for open-source projects
>>> now
>>>>> have 4 CPUS and 16GB of memory -
>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
>>>>> 
>>>>> 
>>>>> While they are not as powerful as our self-hosted runners, the
>>>> parallelism
>>>>> we utilise for those brings those builds in not-that bad shape
>> compared
>>>> to
>>>>> self-hosted runners. Typical differences between the public and
>>>> self-hosted
>>>>> runners now for the complete set of tests are ~ 20m for public
>> runners
>>>> and
>>>>> ~14 m for self-hosted ones.
>>>>> 
>>>>> But this is not the only factor - I think committers experience the
>>> "Job
>>>>> failed" for self-hosted runners generally much more often than
>>>>> non-committers (stability of our solution is not best, also we are
>>> using
>>>>> cheaper spot instances). Plus - we limit the total number of
>>> self-hosted
>>>>> runners (35) - so if several committers submit a few PRs and we have
>>>> canary
>>>>> build running, the jobs will wait until runners are available.
>>>>> 
>>>>> And of course it costs the credits/money of sponsors which we could
>> use
>>>> for
>>>>> other things.
>>>>> 
>>>>> I have - as of recently - access to Github Actions metrics - and
>> while
>>>> ASF
>>>>> is keeping an eye and stared limiting the number of parallel jobs
>>>> workflows
>>>>> in projects are run, it looks like even if all committer runs are
>> added
>>>> to
>>>>> the public runners, we will still cause far lower usage that the
>> limits
>>>> are
>>>>> and far lower than some other projects (which I will not name
>> here).  I
>>>>> have access to the metrics so I can monitor our usage and react.
>>>>> 
>>>>> I think possibly - if we switch committers to "public" runners by
>>> default
>>>>> -the experience will not be much worse for them (and sometimes even
>>>> better
>>>>> - because of stability/limited queue).
>>>>> 
>>>>> I was planning this carefully - I made a number of refactors/changes
>> to
>>>> our
>>>>> workflows recently that makes it way easier to manipulate the
>>>> configuration
>>>>> and get various conditions applied to various jobs - so
>>>>> changing/experimenting with those settings should be - well - a
>> breeze
>>>> :).
>>>>> Few recent changes had proven that this change and workflow refactor
>>> were
>>>>> definitely worth the effort, I feel like I finally got a control over
>>> it
>>>>> where previously it was a bit like herding a pack of cats (which I
>>>>> brought to live by myself, but that's another story).
>>>>> 
>>>>> I would like to propose to run an experiment and see how it works if
>> we
>>>>> switch committer PRs back to the public runners - leaving the
>>> self-hosted
>>>>> runners only for canary builds (which makes perfect sense because
>> those
>>>>> builds run a full set of tests and we need as much speed and power
>>> there
>>>> as
>>>>> we can.
>>>>> 
>>>>> This is pretty safe, We should be able to switch back very easily if
>> we
>>>> see
>>>>> problems. I will also monitor it and see if our usage is within the
>>>> limits
>>>>> of the ASF. I can also add the feature that committers should be able
>>> to
>>>>> use self-hosted runners by applying the "use self-hosted runners"
>> label
>>>> to
>>>>> a PR.
>>>>> 
>>>>> Running it for 2-3 weeks should be enough to gather experience from
>>>>> committers - whether things will seem better or worse for them - or
>>> maybe
>>>>> they won't really notice a big difference.
>>>>> 
>>>>> Later we could consider some next steps - disabling the self-hosted
>>>> runners
>>>>> for canary builds if we see that our usage is low and build are fast
>>>>> enough, eventually possibly removing current self-hosted runners and
>>>>> switching to a better k8s based infrastructure (which we are close to
>>> do
>>>>> but it makes it a bit difficult while current self-hosted solution is
>>> so
>>>>> critical to keep it running (like rebuilding the plane while it is
>>>> flying).
>>>>> I'd love to do it gradually in the "change slowly and observe" mode -
>>>>> especially now that I have access to "proper" metrics.
>>>>> 
>>>>> WDYT?
>>>>> 
>>>>> J.
>>>>> 
>>>> 
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Reply via email to