Hi all,

This afternoon, I worked with my agents to conduct a detailed analysis of
our CI compute and elapsed times over the past two weeks. I want to share
the findings, a handful of PRs, and — more importantly — set realistic (and
data-backed) expectations about what will and won't move the needle.

This is a long one, but since there were just **feelings** that things are
different than they are, I wanted to clarify it with precise data gathered
over last 2 weeks.

I think discussions on the dev call where there is no way to look at
details and drag data and where we mostly base any statements on just
wishful thinking - and where anecdotal experiences might not match
statistical reality—are unproductive.

I think we should base more of our discussions on data and not "because we
think things are different".

Full write-up (numbers, tables, methodology):
https://gist.github.com/potiuk/0ddc404dc76353db9849b801e7d43e26



*== Selective checks already save a LOT ==*
The headline is that our selective-checks machinery is already doing an
enormous amount of work. Measured over two weeks:

- vs running full tests for every PR: ~60% fewer compute-hours
- vs the complete all-versions matrix (canary): ~81% fewer

So PRs already run at roughly 19-40% of "run everything." The system works,
and it works well. I want to be clear about that before anyone reads the
optimizations as "CI is broken" — it isn't.


*== The optimizations are small and incremental — with one exception ==*
The one genuinely large item is not really an optimization, it's fixing a
mistake merged 2 weeks ago (!): generated/provider_dependencies.json was
accidentally re-added to git tracking ~2 weeks ago together with the
*ClickHouse* *provider *(even if the provider_dependencies.json + shasum
were gitignored - they were forcefully added).

Because it was tracked again, every provider-dependency regeneration looked
like a dependency change and forced the full all-Python-versions matrix —
~2,700 compute-hours / 2 weeks for nothing. Removing it (#68801) should
also fix some of the issues that few of us had when the generated and
committed file was stale - that was one of the reasons I removed it in
April 2025 when we switched our infra to uv. That's almost 12% less compute.

Everything else is genuinely small and incremental, yielding only a few
percent each:

- #68533 (merged) — standard venv-operator tests made DB-free so they
parallelize
- #68802 — stop non-test workflows + prek-only changes forcing the full
matrix
- #68814 — rebalance the provider test groups (split the serial monolith)

- #68821 — reduce canary frequency (AMD 4->2/day, ARM 8->2/day). This one
is a real TRADE-OFF, not a free win: it saves ~6% of AMD compute, but it
makes our investigations potentially harder and our reaction time to main
regressions slower — a regression can sit undetected up to ~12h instead of
~6h. We should only take it if we consciously accept that.

Combined these are worth ~20% of AMD compute — meaningful, but incremental,
and the canary one trades coverage/latency for it.


*== The real bottleneck is the shared ASF public runner queue, not our
tests ==*
Our jobs spent ~3,960 job-hours over two weeks simply *waiting for a
runner*, and jobs in big runs wait ~2.5x longer than in small ones because
each big run floods the shared pool. No amount of our own test-trimming
changes this fundamentally — it's a capacity/contention problem, not a "we
test too much" problem.

And the important reality check, based on history: the only thing that
changes the contributor experience *dramatically* — historically a ~4x
speedup in elapsed time — is more powerful hardware OUTSIDE the shared ASF
public runner queue. Our incremental optimizations help at the margins;
dedicated hardware is the step change.


*== Why "complaining" to Infra is the wrong move ==*
We have been here before—more than 5 years ago, when Infra had far fewer
runners, we went through a very similar exercise. The structural issues
haven't changed:

   - The public runner pool is SHARED across all Apache projects. Other
   projects may or may not optimize their tests the way we do — we can't
   control that.
   - Everyone's traffic is up sharply because of AI-generated PRs, so the
   contention is systemic, not Airflow-specific.
   - Infra has no real mechanism to fix this for us other than what they
   have always recommended: a PMC arranging its own dedicated, self-hosted
   runners. If you look at the discussion we had in October 2020
   https://lists.apache.org/thread/8htrdgf2h8qz1hv7mbb96v8l8x8d1dyl - this
   was the only solution that we could apply back then, and it worked.


*Vikram* — if you want to follow up on writing to them, that's fine, but
the message has to recognize the reality we're in. Last time, the thing
that actually worked was GitHub giving us ~3x more runners. That is
unlikely to repeat: GitHub themselves now have to scale their own
infrastructure 20-30x for the same AI-driven reasons, so a "please give us
3x again" ask is not realistic. If we write, it has to be grounded in that.

*Shahar* — I think your exploration of Kubernetes self-hosted runners on
cheaper spot instances is the most promising path by far. That is the lever
that gives us the ~4x, on our own terms, without depending on the shared
pool or on Infra's capacity - and one that we can easily control which runs
should be faster - canary runs + committer runs seems like a good idea -
and it  is the only *100% *resilient to external AI-generated flood. I'd
strongly suggest continuing it — it's the fastest way to actually improve
our contributors' experience.

== Next steps ==

Everyone - please review/approve those PRs and let's merge them. They will
help "a little," but won't solve the underlying runner contention:

* https://github.com/apache/airflow/pull/68801
* https://github.com/apache/airflow/pull/68802
* https://github.com/apache/airflow/pull/68814
* https://github.com/apache/airflow/pull/68821

Happy to discuss any of the above (based on data and analysis - ideally).

Also, if others have concrete ideas on **what** we can improve—such as
which tests we might skip under certain circumstances - all concrete ideas
are welcome. The Gist above has a lot of data.

But we should all realise that pretty much everything here is a trade-off -
usually we trade elapsed time and compute time for greater certainty that a
merge will not break all other PRs.

Best,
Jarek

Reply via email to