*Update on the workflow refactor for our CI* *Tl;DR; We should have some good speed-ups, stability and simplifications in our CI. Read on if you are interested in details*
Some first passes on the new workflows have been done and small "teething" problems addressed - so far so good. Once again kudos to Jacob (CC:) who developed and published the action in https://github.com/apache/infrastructure-actions/tree/main/stash - it's been extremely useful as better replacement of the github built-in "cache" action - with no size limitations, automated retention of cache and better support of cross-fork cache. We also worked - mostly with Pavan (kudos to all the PRs, help, reviews, and watchful eyes there!) on some follow-ups and there are some good things if you are interested in what's been improved. *Caching improvements for venvs and pre-commits and adding more `uv` * We doubled-down on "stash" action and have now quite comprehensive caching implemented in various stages of our builds. We used the "quiet" period around Xmas/New Year and tried to optimize various areas where build time could be improved. There are some interesting findings - It turned out that `uv` is SO FAST that in several cases using cache is SLOWER (installing breeze, creating k8s test environments). Simply the overhead to run the action, download and extract the cache from the storage is bigger than time for uv to resolve, pull and install the packages. So we do not use caching there, and appropriate comments are added. We reviewed all the places where `pip` was still used and applied "use-uv" in all the places it has not been used, and added caching where it made sense. This is mostly in https://github.com/apache/airflow/pull/45289 Some numbers: * installing pre-commits now in most cases (when pre-commits are not modified) is down from 1m 35 -> 30s * installing k8s env (which is multiplied by a number of k8s tests we run in some PRS) is down from 1m -> 10s - this is particularly visible in "canary runs" when we run now 32x variants of those tests :) * uv applied in a few missing places decreased installation of some venvs to merely about 1 minute to few seconds. *Cache mount CI improvements* Wa also managed to come up with some "eat-cake and have it too" - where we had conflicting CI and local dev needs. We are now using "--mount-cache" to keep uv cache and speed up local Breeze CI builds, but that caused CI builds to be generally slower 5 minutes 45s to build the image - because the uv had to reinstall airflow + 700+ dependencies from scratch. But with https://github.com/apache/airflow/pull/45314 we also used "stash" action to cache the `uv` cache in canary builds (and we can re-use the cache in fork PRs). This way we are down to a bit more than 3 minutes to build the image. While this is not as "fast" as the previous approach using GitHub Registry and pre-caching installation from "main", the fact that we have a single workflow and no "pull_request_target" and no "wait for images", and generally simplifying the approach makes it up altogether. So we got: 5m 45s -> 3m 10s .. The "local" experience with building "breeze" images should be much better overall - after the first local build of an image, subsequent rebuilds even with a lot of changes in dependencies should be way faster - generally rebuilding the breeze image once it has been built locally oonce should take seconds rather than minutes (this might be longer if we have new python patchlevel released - but it happens once every few weeks). Here is the build broken down to: downloading the uv cache, importing it, and building the image using it. [image: Screenshot 2025-01-01 at 10.07.04.png] If we had just the "build image" step without caching - it would be ~ 5m 45 seconds - so we save almost 50% of CI image build time per python version per run. That's also about 3 minutes shorter feedback time - less waiting. *Reproducing CI failures locally* Thanks to Pavan's https://github.com/apache/airflow/pull/45287 (further improved by https://github.com/apache/airflow/pull/45296 and https://github.com/apache/airflow/pull/45324 ) we have an even easier way now to reproduce the CI builds (if you have an AMD machine for now - that will change in the future when we have ARC enabled and ARM builds in CI). Whenever you want to locally reproduce a failure in CI you should be able to specify PR# that you want to "reproduce failure of"; breeze ci-image load --from-pr 12345 --python 3.9 --github-token TOKEN Similarly when you want to reproduce a specific run failure: breeze ci-image load --from-run 12538475388 --python 3.9 --github-token TOKEN After that, your local image will be exactly the same as the one in CI - with this command, you can be dropped into breeze shell and do some testing there, without switching to the branch of the PR breeze shell --mount-sources skip [OTHER OPTIONS] If you check-out the branch of the PR that was used, regular ``breeze`` commands will also reproduce the CI environment without having to rebuild the image - for example when dependencies changed or when new dependencies were released and used in the CI job - and you will be able to edit source files locally as usual. This is nicely described in our docs in a few relevant places in our docs - mainly here: https://github.com/apache/airflow/blob/main/dev/breeze/doc/ci/07_running_ci_locally.md - I updated the docs and refreshed it, and it should be easier to understand and follow it. Jarek and Pavan On Mon, Dec 30, 2024 at 7:17 AM Amogh Desai <amoghdesai....@gmail.com> wrote: > Thanks Jarek for simplifying the workflows and thanks for the announcement > too, or contributors would probably be pretty lost > if something strange happened. > > Thanks & Regards, > Amogh Desai > > > On Mon, Dec 30, 2024 at 11:26 AM Vishnu Chilukoori < > vish.chiluko...@gmail.com> wrote: > > > Thanks Jarek, great work on simplifying and securing CI workflows! > > > > > > -- > > Regards, > > Vishnu Chilukoori > > > > On Sun, Dec 29, 2024 at 2:57 PM Pavankumar Gopidesu < > > gopidesupa...@gmail.com> > > wrote: > > > > > Woohooo Thanks Jarek Great work :) > > > > > > Regards, > > > Pavan > > > > > > On Sun, Dec 29, 2024 at 10:15 PM Jarek Potiuk <ja...@potiuk.com> > wrote: > > > > > > > > Hello here, > > > > > > > > TL;DR; I just merged https://github.com/apache/airflow/pull/45266 - > > > > which implemented a much simplified and nicer workflow for our CI. > > > > > > > > Rebase to the latest `main` and you should be good to go. > > > > > > > > It (finally) switches o from a workflow we had for years (using > pretty > > > > dangerous from the security point of view `pull_request_target` > > > workflow) - > > > > into using Artifacts for sharing images in workflow. This was > possible > > > > thanks to new "artifacts" actions and switching to UV. > > > > > > > > The benefit of it is that it is way safer - no more "dangerous > > workflows" > > > > and simpler - we have a lot simpler Dockerfile.ci and caching > mechanism > > > > implemented. We worked this out by discussing with other ASF projects > > and > > > > actually even reusing an action developed by a fellow Apache Arrow > > > > committer and PMC member - Jacob Wujciak. > > > > > > > > The things everyone should do: > > > > > > > > * rebase your PR to latest main to make your PRs rebuilt using the > new > > > > workflow > > > > * run `breeze ci-image build` if you are using breeze locally > > > > > > > > I expect some teething problems, so do not hesitate to raise your > > > problems > > > > in #internal-airflow-ci-cd channel for CI or #airflow-breeze channel > if > > > you > > > > see breeze problems > > > > > > > > Your regular workflows should continue working as usual, you should > see > > > > just one workflow in CI running builds and tests instead of two. > > > > > > > > J. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > > > > > >