potiuk commented on PR #30672:
URL: https://github.com/apache/airflow/pull/30672#issuecomment-1510508451

   Ok. Not sure if you know what you've asked for, but well. you did 
@hussein-awala  :). 
   
   > I have a concern regarding the use of a lot of parallelism in our 
workflows. While it can greatly enhance our development speed, there is a 
possibility that a few number of PRs could occupy all the runners, potentially 
delaying the progress of other PRs. It would be beneficial to have some kind of 
configurations in place to prevent this from happening, such as limiting the 
number of runners a single PR can use. This would help ensure that all PRs have 
access to the necessary resources and can be processed efficiently.
   
   Looking at the available metrics I don't think we need to do anything about 
it. I am regularly checking those that we have (see below) and every time I see 
an optimisation opportunity, I do implement it, however I would love someone 
else to take a deep dive and understand our system and improve it. So if you 
would like to join me - I am all ears, and happy to share everything about it. 
I think it would be best if you try to understand how our system works in 
detail, and then we could discuss some ways how things can be improved. If you 
would have some ideas (after getting the grasp of it) how to improce it and 
what it might take, I would love to see it. 
   
   The "noisy-neighbour" issue has been on-going debate for the Apache Software 
Foundation and Airflow for years now and it used to be very bad at times, but 
with a lot of our effort, helping others to optimise their jobs and especially 
with GitHub providing more public runners and us having sponsored self-hosted 
runners, the issue is non-existing currently IMHO.
   
   Just for the context: all  non-committers use public runners (except image 
building), all our public runners run in a shared pool of 900 jobs that the ASF 
shares between all of theor ASF projects - there are ~ 700 projects in the ASF 
using GitHub Actions) and for Airlfow (we have self-hosted runners in an 
auto-scaling group of up to 35 runners (8CPU , 64 GB) that can be scaled 
on-demand - we run them all on AWS provided credits and if needed, sponsored by 
Astronomer, when the credits run out). 
   
   Here is one of the charts that shows how many projects are there for the 
ASF, using GitHub Actions: 
   
   <img width="1253" alt="Screenshot 2023-04-17 at 00 42 37" 
src="https://user-images.githubusercontent.com/595491/232347014-8454e385-5704-42d4-9e90-b893d33bfa05.png";>
   
   For the ASF metrics and solving "noisy-neighbour" problems were up for about 
3 years now and I doubt we can solve it quickly. though if you have some ideas, 
I am all ears. We've been discussing similar subject at lengths (I will post 
some resources) and if you can add something to the discussion (after reading 
that and getting the context, it would be great). Adding some metrics that 
could give us more value would be great. 
   
   Also we have a big change to do - convert our ci-infra (developed initially 
by @ashb based on some AWS components - Dynamo DB, AutoScaling Groups and few 
others to Kuberentes - since the time we had to implement our own custom 
solution, there are a few (even supported by GitHub) ways how to deploy it on 
Kubernetes, and we have a promise also from Google to get free credits on thoir 
cloud if wa mange to do it as standard Kubernetes deployment, which woudl 
increase our CI capacity greatly.
   
   The infrastructure of ours for the auto-scaling is in 
https://github.com/apache/airflow-ci-infra  (I am happy to guide your way - 
it's not super-documented, but pretty "up-to-date" and code describes what we 
do pretty well.
   
   Note - I am not going to discourage you quite the opposite - I'd love to get 
a helping hand that would start understanding and contributing to our system. 
So I will add some pointers so that you can take a look and see.
   
   > I am not sure if we have any metrics or analytics available to measure the 
average duration in queue and the overall duration of a workflow. It would be 
helpful to have access to this information to evaluate the impact of any 
changes we make to our CI workflow. By tracking the average queue duration and 
the total duration of a workflow, we can better understand how changes affect 
our development process and identify areas for improvement.
   
   I am so glad someone else would like to look at that :) . And I would love 
to get some PRs improving the test harness we have on CI. Or if you could get 
some metrics and analyse them would be good, but I am afraid metrics for GitHub 
Actions are very poor.  But if you could improve them it would be good. You can 
read some history of that and there is some metrics that a friend of mine and 
contributor @TobKed  had developed in the absence of lack of those metrics for 
GitHub Actions for the whole Apache Software Foundation (we continue gathering 
this metrics for ~2 years while Github with the "duck-tape" solution of our  
and continuous promises to get a better metrics from GitHub, but they are not 
yet available - though promised) if you send me your gmail address, I can shere 
access to the report (ping me om slack). Here is an example "workflow queued 
average" for all ASF projects in public runners:
   
   <img width="1354" alt="Screenshot 2023-04-17 at 00 19 15" 
src="https://user-images.githubusercontent.com/595491/232345787-61c43340-72ca-43a5-9fbb-20d04b4327a9.png";>
   
   This one shows that airflow is not even on the radar of queued worklfows 
because of the heavy optimisations we've implemented. It used to be much more, 
and we had a lot of troubles ~ 3 years ago, but then I implemented plenty of 
optimisations and checks and @ashb implemented the self-hosted infrastructure 
of ours  (and we got credits to run it).  Since then also the ASF got from 50 
to 900 jobs sponsored by GitHub, which (at least temporary) solved the 
noisy-neighbour problems we had, but since then we try to use the runners in as 
optimized way as possible (see below).
   
   <img width="1301" alt="Screenshot 2023-04-17 at 00 26 17" 
src="https://user-images.githubusercontent.com/595491/232346157-9ff7e8bc-00f0-44b4-b1cc-20659cefe214.png";>
   
   The docs describing some status of GitHub Actions in the ASF (I try to keep 
it updated though recently there were not many updates ) are here: 
https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status and
   
   Regarding the optimizations we have - I think you should look at 
"selective-checks" part. We go to a GREAT length to make sure that each PR only 
runs whatever it needs to run. Basically PRs usually only run a very small 
selection of tests - only those that are very likely to be affected. For 
example Unit tests for helm charts are only run when helm chart are modified. 
There are many more rules (for example Provider tests are only run for a subset 
of those - only the providers that are changed and those that have 
cross-dependencies with the changed providers.  And there are more rules, all 
of them described in 
https://github.com/apache/airflow/blob/main/dev/breeze/SELECTIVE_CHECKS.md 
(which also reminds me that I should review and update it as there are few more 
such selective check rules which are not documented there. Also vast majority 
of those (with a bit more complex logic) selective tests are nicely tested in 
breeze via selective checks unit tests (rather comprehensive): http
 
s://github.com/apache/airflow/blob/main/dev/breeze/tests/test_selective_checks.py
   
   You can get even more context here by reading the description (I keep it 
up-to-date regularly) of our CI system:
   
   https://github.com/apache/airflow/blob/main/CI.rst
   
   Including those charts showing how our jobs work (and explaining some of the 
built-in optimisations, for example we only build images once and re-use them 
accross all the builds, using super-efficient remote caching that 
docker-buildkit added some time ago:
   
   https://github.com/apache/airflow/blob/main/CI_DIAGRAMS.md
   
   I know it is a lot of information, but hey, you asked :). 
   
   And I am would really love to have someone else look at all those, not only 
me :). I am quite a bit SPOF for it, so if you would like to and have more 
questions, I am happy to share everything (though it might take some time to 
grasp it, I am afraid).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to