potiuk commented on PR #30672: URL: https://github.com/apache/airflow/pull/30672#issuecomment-1510508451
Ok. Not sure if you know what you've asked for, but well. you did @hussein-awala :). > I have a concern regarding the use of a lot of parallelism in our workflows. While it can greatly enhance our development speed, there is a possibility that a few number of PRs could occupy all the runners, potentially delaying the progress of other PRs. It would be beneficial to have some kind of configurations in place to prevent this from happening, such as limiting the number of runners a single PR can use. This would help ensure that all PRs have access to the necessary resources and can be processed efficiently. Looking at the available metrics I don't think we need to do anything about it. I am regularly checking those that we have (see below) and every time I see an optimisation opportunity, I do implement it, however I would love someone else to take a deep dive and understand our system and improve it. So if you would like to join me - I am all ears, and happy to share everything about it. I think it would be best if you try to understand how our system works in detail, and then we could discuss some ways how things can be improved. If you would have some ideas (after getting the grasp of it) how to improce it and what it might take, I would love to see it. The "noisy-neighbour" issue has been on-going debate for the Apache Software Foundation and Airflow for years now and it used to be very bad at times, but with a lot of our effort, helping others to optimise their jobs and especially with GitHub providing more public runners and us having sponsored self-hosted runners, the issue is non-existing currently IMHO. Just for the context: all non-committers use public runners (except image building), all our public runners run in a shared pool of 900 jobs that the ASF shares between all of theor ASF projects - there are ~ 700 projects in the ASF using GitHub Actions) and for Airlfow (we have self-hosted runners in an auto-scaling group of up to 35 runners (8CPU , 64 GB) that can be scaled on-demand - we run them all on AWS provided credits and if needed, sponsored by Astronomer, when the credits run out). Here is one of the charts that shows how many projects are there for the ASF, using GitHub Actions: <img width="1253" alt="Screenshot 2023-04-17 at 00 42 37" src="https://user-images.githubusercontent.com/595491/232347014-8454e385-5704-42d4-9e90-b893d33bfa05.png"> For the ASF metrics and solving "noisy-neighbour" problems were up for about 3 years now and I doubt we can solve it quickly. though if you have some ideas, I am all ears. We've been discussing similar subject at lengths (I will post some resources) and if you can add something to the discussion (after reading that and getting the context, it would be great). Adding some metrics that could give us more value would be great. Also we have a big change to do - convert our ci-infra (developed initially by @ashb based on some AWS components - Dynamo DB, AutoScaling Groups and few others to Kuberentes - since the time we had to implement our own custom solution, there are a few (even supported by GitHub) ways how to deploy it on Kubernetes, and we have a promise also from Google to get free credits on thoir cloud if wa mange to do it as standard Kubernetes deployment, which woudl increase our CI capacity greatly. The infrastructure of ours for the auto-scaling is in https://github.com/apache/airflow-ci-infra (I am happy to guide your way - it's not super-documented, but pretty "up-to-date" and code describes what we do pretty well. Note - I am not going to discourage you quite the opposite - I'd love to get a helping hand that would start understanding and contributing to our system. So I will add some pointers so that you can take a look and see. > I am not sure if we have any metrics or analytics available to measure the average duration in queue and the overall duration of a workflow. It would be helpful to have access to this information to evaluate the impact of any changes we make to our CI workflow. By tracking the average queue duration and the total duration of a workflow, we can better understand how changes affect our development process and identify areas for improvement. I am so glad someone else would like to look at that :) . And I would love to get some PRs improving the test harness we have on CI. Or if you could get some metrics and analyse them would be good, but I am afraid metrics for GitHub Actions are very poor. But if you could improve them it would be good. You can read some history of that and there is some metrics that a friend of mine and contributor @TobKed had developed in the absence of lack of those metrics for GitHub Actions for the whole Apache Software Foundation (we continue gathering this metrics for ~2 years while Github with the "duck-tape" solution of our and continuous promises to get a better metrics from GitHub, but they are not yet available - though promised) if you send me your gmail address, I can shere access to the report (ping me om slack). Here is an example "workflow queued average" for all ASF projects in public runners: <img width="1354" alt="Screenshot 2023-04-17 at 00 19 15" src="https://user-images.githubusercontent.com/595491/232345787-61c43340-72ca-43a5-9fbb-20d04b4327a9.png"> This one shows that airflow is not even on the radar of queued worklfows because of the heavy optimisations we've implemented. It used to be much more, and we had a lot of troubles ~ 3 years ago, but then I implemented plenty of optimisations and checks and @ashb implemented the self-hosted infrastructure of ours (and we got credits to run it). Since then also the ASF got from 50 to 900 jobs sponsored by GitHub, which (at least temporary) solved the noisy-neighbour problems we had, but since then we try to use the runners in as optimized way as possible (see below). <img width="1301" alt="Screenshot 2023-04-17 at 00 26 17" src="https://user-images.githubusercontent.com/595491/232346157-9ff7e8bc-00f0-44b4-b1cc-20659cefe214.png"> The docs describing some status of GitHub Actions in the ASF (I try to keep it updated though recently there were not many updates ) are here: https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status and Regarding the optimizations we have - I think you should look at "selective-checks" part. We go to a GREAT length to make sure that each PR only runs whatever it needs to run. Basically PRs usually only run a very small selection of tests - only those that are very likely to be affected. For example Unit tests for helm charts are only run when helm chart are modified. There are many more rules (for example Provider tests are only run for a subset of those - only the providers that are changed and those that have cross-dependencies with the changed providers. And there are more rules, all of them described in https://github.com/apache/airflow/blob/main/dev/breeze/SELECTIVE_CHECKS.md (which also reminds me that I should review and update it as there are few more such selective check rules which are not documented there. Also vast majority of those (with a bit more complex logic) selective tests are nicely tested in breeze via selective checks unit tests (rather comprehensive): http s://github.com/apache/airflow/blob/main/dev/breeze/tests/test_selective_checks.py You can get even more context here by reading the description (I keep it up-to-date regularly) of our CI system: https://github.com/apache/airflow/blob/main/CI.rst Including those charts showing how our jobs work (and explaining some of the built-in optimisations, for example we only build images once and re-use them accross all the builds, using super-efficient remote caching that docker-buildkit added some time ago: https://github.com/apache/airflow/blob/main/CI_DIAGRAMS.md I know it is a lot of information, but hey, you asked :). And I am would really love to have someone else look at all those, not only me :). I am quite a bit SPOF for it, so if you would like to and have more questions, I am happy to share everything (though it might take some time to grasp it, I am afraid). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
