Hi all, This afternoon, I worked with my agents to conduct a detailed analysis of our CI compute and elapsed times over the past two weeks. I want to share the findings, a handful of PRs, and — more importantly — set realistic (and data-backed) expectations about what will and won't move the needle.
This is a long one, but since there were just **feelings** that things are different than they are, I wanted to clarify it with precise data gathered over last 2 weeks. I think discussions on the dev call where there is no way to look at details and drag data and where we mostly base any statements on just wishful thinking - and where anecdotal experiences might not match statistical reality—are unproductive. I think we should base more of our discussions on data and not "because we think things are different". Full write-up (numbers, tables, methodology): https://gist.github.com/potiuk/0ddc404dc76353db9849b801e7d43e26 *== Selective checks already save a LOT ==* The headline is that our selective-checks machinery is already doing an enormous amount of work. Measured over two weeks: - vs running full tests for every PR: ~60% fewer compute-hours - vs the complete all-versions matrix (canary): ~81% fewer So PRs already run at roughly 19-40% of "run everything." The system works, and it works well. I want to be clear about that before anyone reads the optimizations as "CI is broken" — it isn't. *== The optimizations are small and incremental — with one exception ==* The one genuinely large item is not really an optimization, it's fixing a mistake merged 2 weeks ago (!): generated/provider_dependencies.json was accidentally re-added to git tracking ~2 weeks ago together with the *ClickHouse* *provider *(even if the provider_dependencies.json + shasum were gitignored - they were forcefully added). Because it was tracked again, every provider-dependency regeneration looked like a dependency change and forced the full all-Python-versions matrix — ~2,700 compute-hours / 2 weeks for nothing. Removing it (#68801) should also fix some of the issues that few of us had when the generated and committed file was stale - that was one of the reasons I removed it in April 2025 when we switched our infra to uv. That's almost 12% less compute. Everything else is genuinely small and incremental, yielding only a few percent each: - #68533 (merged) — standard venv-operator tests made DB-free so they parallelize - #68802 — stop non-test workflows + prek-only changes forcing the full matrix - #68814 — rebalance the provider test groups (split the serial monolith) - #68821 — reduce canary frequency (AMD 4->2/day, ARM 8->2/day). This one is a real TRADE-OFF, not a free win: it saves ~6% of AMD compute, but it makes our investigations potentially harder and our reaction time to main regressions slower — a regression can sit undetected up to ~12h instead of ~6h. We should only take it if we consciously accept that. Combined these are worth ~20% of AMD compute — meaningful, but incremental, and the canary one trades coverage/latency for it. *== The real bottleneck is the shared ASF public runner queue, not our tests ==* Our jobs spent ~3,960 job-hours over two weeks simply *waiting for a runner*, and jobs in big runs wait ~2.5x longer than in small ones because each big run floods the shared pool. No amount of our own test-trimming changes this fundamentally — it's a capacity/contention problem, not a "we test too much" problem. And the important reality check, based on history: the only thing that changes the contributor experience *dramatically* — historically a ~4x speedup in elapsed time — is more powerful hardware OUTSIDE the shared ASF public runner queue. Our incremental optimizations help at the margins; dedicated hardware is the step change. *== Why "complaining" to Infra is the wrong move ==* We have been here before—more than 5 years ago, when Infra had far fewer runners, we went through a very similar exercise. The structural issues haven't changed: - The public runner pool is SHARED across all Apache projects. Other projects may or may not optimize their tests the way we do — we can't control that. - Everyone's traffic is up sharply because of AI-generated PRs, so the contention is systemic, not Airflow-specific. - Infra has no real mechanism to fix this for us other than what they have always recommended: a PMC arranging its own dedicated, self-hosted runners. If you look at the discussion we had in October 2020 https://lists.apache.org/thread/8htrdgf2h8qz1hv7mbb96v8l8x8d1dyl - this was the only solution that we could apply back then, and it worked. *Vikram* — if you want to follow up on writing to them, that's fine, but the message has to recognize the reality we're in. Last time, the thing that actually worked was GitHub giving us ~3x more runners. That is unlikely to repeat: GitHub themselves now have to scale their own infrastructure 20-30x for the same AI-driven reasons, so a "please give us 3x again" ask is not realistic. If we write, it has to be grounded in that. *Shahar* — I think your exploration of Kubernetes self-hosted runners on cheaper spot instances is the most promising path by far. That is the lever that gives us the ~4x, on our own terms, without depending on the shared pool or on Infra's capacity - and one that we can easily control which runs should be faster - canary runs + committer runs seems like a good idea - and it is the only *100% *resilient to external AI-generated flood. I'd strongly suggest continuing it — it's the fastest way to actually improve our contributors' experience. == Next steps == Everyone - please review/approve those PRs and let's merge them. They will help "a little," but won't solve the underlying runner contention: * https://github.com/apache/airflow/pull/68801 * https://github.com/apache/airflow/pull/68802 * https://github.com/apache/airflow/pull/68814 * https://github.com/apache/airflow/pull/68821 Happy to discuss any of the above (based on data and analysis - ideally). Also, if others have concrete ideas on **what** we can improve—such as which tests we might skip under certain circumstances - all concrete ideas are welcome. The Gist above has a lot of data. But we should all realise that pretty much everything here is a trade-off - usually we trade elapsed time and compute time for greater certainty that a merge will not break all other PRs. Best, Jarek
