[GitHub] [airflow] potiuk commented on pull request #30672: Parallelize Helm tests with multiple job runners

via GitHub Sun, 16 Apr 2023 15:46:08 -0700


potiuk commented on PR #30672:
URL: https://github.com/apache/airflow/pull/30672#issuecomment-1510508451

Ok. Not sure if you know what you've asked for, but well. you did
@hussein-awala :).

> I have a concern regarding the use of a lot of parallelism in our
workflows. While it can greatly enhance our development speed, there is a
possibility that a few number of PRs could occupy all the runners, potentially
delaying the progress of other PRs. It would be beneficial to have some kind of
configurations in place to prevent this from happening, such as limiting the
number of runners a single PR can use. This would help ensure that all PRs have
access to the necessary resources and can be processed efficiently.

Looking at the available metrics I don't think we need to do anything about
it. I am regularly checking those that we have (see below) and every time I see
an optimisation opportunity, I do implement it, however I would love someone
else to take a deep dive and understand our system and improve it. So if you
would like to join me - I am all ears, and happy to share everything about it.
I think it would be best if you try to understand how our system works in
detail, and then we could discuss some ways how things can be improved. If you
would have some ideas (after getting the grasp of it) how to improce it and
what it might take, I would love to see it.

The "noisy-neighbour" issue has been on-going debate for the Apache Software
Foundation and Airflow for years now and it used to be very bad at times, but
with a lot of our effort, helping others to optimise their jobs and especially
with GitHub providing more public runners and us having sponsored self-hosted
runners, the issue is non-existing currently IMHO.

Just for the context: all non-committers use public runners (except image
building), all our public runners run in a shared pool of 900 jobs that the ASF
shares between all of theor ASF projects - there are ~ 700 projects in the ASF
using GitHub Actions) and for Airlfow (we have self-hosted runners in an
auto-scaling group of up to 35 runners (8CPU , 64 GB) that can be scaled
on-demand - we run them all on AWS provided credits and if needed, sponsored by
Astronomer, when the credits run out).

Here is one of the charts that shows how many projects are there for the
ASF, using GitHub Actions:

For the ASF metrics and solving "noisy-neighbour" problems were up for about
3 years now and I doubt we can solve it quickly. though if you have some ideas,
I am all ears. We've been discussing similar subject at lengths (I will post
some resources) and if you can add something to the discussion (after reading
that and getting the context, it would be great). Adding some metrics that
could give us more value would be great.

Also we have a big change to do - convert our ci-infra (developed initially
by @ashb based on some AWS components - Dynamo DB, AutoScaling Groups and few
others to Kuberentes - since the time we had to implement our own custom
solution, there are a few (even supported by GitHub) ways how to deploy it on
Kubernetes, and we have a promise also from Google to get free credits on thoir
cloud if wa mange to do it as standard Kubernetes deployment, which woudl
increase our CI capacity greatly.

The infrastructure of ours for the auto-scaling is in
https://github.com/apache/airflow-ci-infra (I am happy to guide your way -
it's not super-documented, but pretty "up-to-date" and code describes what we
do pretty well.

Note - I am not going to discourage you quite the opposite - I'd love to get
a helping hand that would start understanding and contributing to our system.
So I will add some pointers so that you can take a look and see.

> I am not sure if we have any metrics or analytics available to measure the
average duration in queue and the overall duration of a workflow. It would be
helpful to have access to this information to evaluate the impact of any
changes we make to our CI workflow. By tracking the average queue duration and
the total duration of a workflow, we can better understand how changes affect
our development process and identify areas for improvement.

I am so glad someone else would like to look at that :) . And I would love
to get some PRs improving the test harness we have on CI. Or if you could get
some metrics and analyse them would be good, but I am afraid metrics for GitHub
Actions are very poor. But if you could improve them it would be good. You can
read some history of that and there is some metrics that a friend of mine and
contributor @TobKed had developed in the absence of lack of those metrics for
GitHub Actions for the whole Apache Software Foundation (we continue gathering
this metrics for ~2 years while Github with the "duck-tape" solution of our
and continuous promises to get a better metrics from GitHub, but they are not
yet available - though promised) if you send me your gmail address, I can shere
access to the report (ping me om slack). Here is an example "workflow queued
average" for all ASF projects in public runners:

This one shows that airflow is not even on the radar of queued worklfows
because of the heavy optimisations we've implemented. It used to be much more,
and we had a lot of troubles ~ 3 years ago, but then I implemented plenty of
optimisations and checks and @ashb implemented the self-hosted infrastructure
of ours (and we got credits to run it). Since then also the ASF got from 50
to 900 jobs sponsored by GitHub, which (at least temporary) solved the
noisy-neighbour problems we had, but since then we try to use the runners in as
optimized way as possible (see below).

The docs describing some status of GitHub Actions in the ASF (I try to keep
it updated though recently there were not many updates ) are here:
https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status and

Regarding the optimizations we have - I think you should look at
"selective-checks" part. We go to a GREAT length to make sure that each PR only
runs whatever it needs to run. Basically PRs usually only run a very small
selection of tests - only those that are very likely to be affected. For
example Unit tests for helm charts are only run when helm chart are modified.
There are many more rules (for example Provider tests are only run for a subset
of those - only the providers that are changed and those that have
cross-dependencies with the changed providers. And there are more rules, all
of them described in
https://github.com/apache/airflow/blob/main/dev/breeze/SELECTIVE_CHECKS.md
(which also reminds me that I should review and update it as there are few more
such selective check rules which are not documented there. Also vast majority
of those (with a bit more complex logic) selective tests are nicely tested in
breeze via selective checks unit tests (rather comprehensive): http

s://github.com/apache/airflow/blob/main/dev/breeze/tests/test_selective_checks.py

You can get even more context here by reading the description (I keep it
up-to-date regularly) of our CI system:

https://github.com/apache/airflow/blob/main/CI.rst

Including those charts showing how our jobs work (and explaining some of the
built-in optimisations, for example we only build images once and re-use them
accross all the builds, using super-efficient remote caching that
docker-buildkit added some time ago:

https://github.com/apache/airflow/blob/main/CI_DIAGRAMS.md

I know it is a lot of information, but hey, you asked :).

And I am would really love to have someone else look at all those, not only
me :). I am quite a bit SPOF for it, so if you would like to and have more
questions, I am happy to share everything (though it might take some time to
grasp it, I am afraid).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on pull request #30672: Parallelize Helm tests with multiple job runners

Reply via email to