Ah - one more thing - I almost forgot. We could potentially save up around 5 % of compute if we delay running heavy jobs (tests, for example), but this increases elapsed time. Around 12% of the time used are PRs that get cancelled by a new push of the same PR while they were running. I doubt we can gain a lot here, because usually pople push new version of PR when they see first tests failing and they realise they need to make a fix-up, and it's already too late - the tests are running in parallel already then and using compute - so that one is not really feasible to get significant gains, and it would cause increased elapsed time.
On Mon, Jun 22, 2026 at 1:33 AM Jarek Potiuk <[email protected]> wrote: > BTW. Vikram, your accidental triggering of a full test build occurred for > various reasons in 3.3% of all CI runs - with 3.6% impact on the compute > time. So if you or others were afraid we were burning money, using power > and wasting water, that was totally unfounded. This is was - as I > explained - anecdotal experience, not something that happened consistency. > Even if the impact is small, I added additional protection for similar > cases in #68802. > > J. > > > On Mon, Jun 22, 2026 at 1:22 AM Jarek Potiuk <[email protected]> wrote: > >> Hi all, >> >> This afternoon, I worked with my agents to conduct a detailed analysis of >> our CI compute and elapsed times over the past two weeks. I want to share >> the findings, a handful of PRs, and — more importantly — set realistic (and >> data-backed) expectations about what will and won't move the needle. >> >> This is a long one, but since there were just **feelings** that things >> are different than they are, I wanted to clarify it with precise data >> gathered over last 2 weeks. >> >> I think discussions on the dev call where there is no way to look at >> details and drag data and where we mostly base any statements on just >> wishful thinking - and where anecdotal experiences might not match >> statistical reality—are unproductive. >> >> I think we should base more of our discussions on data and not "because >> we think things are different". >> >> Full write-up (numbers, tables, methodology): >> https://gist.github.com/potiuk/0ddc404dc76353db9849b801e7d43e26 >> >> >> >> *== Selective checks already save a LOT ==* >> The headline is that our selective-checks machinery is already doing an >> enormous amount of work. Measured over two weeks: >> >> - vs running full tests for every PR: ~60% fewer compute-hours >> - vs the complete all-versions matrix (canary): ~81% fewer >> >> So PRs already run at roughly 19-40% of "run everything." The system >> works, and it works well. I want to be clear about that before anyone reads >> the optimizations as "CI is broken" — it isn't. >> >> >> *== The optimizations are small and incremental — with one exception ==* >> The one genuinely large item is not really an optimization, it's fixing a >> mistake merged 2 weeks ago (!): generated/provider_dependencies.json was >> accidentally re-added to git tracking ~2 weeks ago together with the >> *ClickHouse* *provider *(even if the provider_dependencies.json + shasum >> were gitignored - they were forcefully added). >> >> Because it was tracked again, every provider-dependency regeneration >> looked like a dependency change and forced the full all-Python-versions >> matrix — ~2,700 compute-hours / 2 weeks for nothing. Removing it (#68801) >> should also fix some of the issues that few of us had when the generated >> and committed file was stale - that was one of the reasons I removed it in >> April 2025 when we switched our infra to uv. That's almost 12% less compute. >> >> Everything else is genuinely small and incremental, yielding only a few >> percent each: >> >> - #68533 (merged) — standard venv-operator tests made DB-free so they >> parallelize >> - #68802 — stop non-test workflows + prek-only changes forcing the full >> matrix >> - #68814 — rebalance the provider test groups (split the serial monolith) >> >> - #68821 — reduce canary frequency (AMD 4->2/day, ARM 8->2/day). This one >> is a real TRADE-OFF, not a free win: it saves ~6% of AMD compute, but it >> makes our investigations potentially harder and our reaction time to main >> regressions slower — a regression can sit undetected up to ~12h instead of >> ~6h. We should only take it if we consciously accept that. >> >> Combined these are worth ~20% of AMD compute — meaningful, but >> incremental, and the canary one trades coverage/latency for it. >> >> >> *== The real bottleneck is the shared ASF public runner queue, not our >> tests ==* >> Our jobs spent ~3,960 job-hours over two weeks simply *waiting for a >> runner*, and jobs in big runs wait ~2.5x longer than in small ones because >> each big run floods the shared pool. No amount of our own test-trimming >> changes this fundamentally — it's a capacity/contention problem, not a "we >> test too much" problem. >> >> And the important reality check, based on history: the only thing that >> changes the contributor experience *dramatically* — historically a ~4x >> speedup in elapsed time — is more powerful hardware OUTSIDE the shared ASF >> public runner queue. Our incremental optimizations help at the margins; >> dedicated hardware is the step change. >> >> >> *== Why "complaining" to Infra is the wrong move ==* >> We have been here before—more than 5 years ago, when Infra had far fewer >> runners, we went through a very similar exercise. The structural issues >> haven't changed: >> >> - The public runner pool is SHARED across all Apache projects. Other >> projects may or may not optimize their tests the way we do — we can't >> control that. >> - Everyone's traffic is up sharply because of AI-generated PRs, so >> the contention is systemic, not Airflow-specific. >> - Infra has no real mechanism to fix this for us other than what they >> have always recommended: a PMC arranging its own dedicated, self-hosted >> runners. If you look at the discussion we had in October 2020 >> https://lists.apache.org/thread/8htrdgf2h8qz1hv7mbb96v8l8x8d1dyl - >> this was the only solution that we could apply back then, and it worked. >> >> >> *Vikram* — if you want to follow up on writing to them, that's fine, but >> the message has to recognize the reality we're in. Last time, the thing >> that actually worked was GitHub giving us ~3x more runners. That is >> unlikely to repeat: GitHub themselves now have to scale their own >> infrastructure 20-30x for the same AI-driven reasons, so a "please give us >> 3x again" ask is not realistic. If we write, it has to be grounded in that. >> >> *Shahar* — I think your exploration of Kubernetes self-hosted runners on >> cheaper spot instances is the most promising path by far. That is the lever >> that gives us the ~4x, on our own terms, without depending on the shared >> pool or on Infra's capacity - and one that we can easily control which runs >> should be faster - canary runs + committer runs seems like a good idea - >> and it is the only *100% *resilient to external AI-generated flood. I'd >> strongly suggest continuing it — it's the fastest way to actually improve >> our contributors' experience. >> >> == Next steps == >> >> Everyone - please review/approve those PRs and let's merge them. They >> will help "a little," but won't solve the underlying runner contention: >> >> * https://github.com/apache/airflow/pull/68801 >> * https://github.com/apache/airflow/pull/68802 >> * https://github.com/apache/airflow/pull/68814 >> * https://github.com/apache/airflow/pull/68821 >> >> Happy to discuss any of the above (based on data and analysis - ideally). >> >> Also, if others have concrete ideas on **what** we can improve—such as >> which tests we might skip under certain circumstances - all concrete ideas >> are welcome. The Gist above has a lot of data. >> >> But we should all realise that pretty much everything here is a trade-off >> - usually we trade elapsed time and compute time for greater certainty that >> a merge will not break all other PRs. >> >> Best, >> Jarek >> >
