Ah - one more thing - I almost forgot. We could potentially save up around
5 % of compute if we delay running heavy jobs (tests, for example), but
this increases elapsed time. Around 12% of the time used are PRs that get
cancelled by a new push of the same PR while they were running.  I doubt we
can gain a lot here, because usually pople push new version of PR when they
see first tests failing and they realise they need to make a fix-up, and
it's already too late - the tests are running in parallel already then and
using compute - so that one is not really feasible to get significant
gains, and it would cause increased elapsed time.

On Mon, Jun 22, 2026 at 1:33 AM Jarek Potiuk <[email protected]> wrote:

> BTW. Vikram, your accidental triggering of a full test build occurred for
> various reasons in 3.3% of all CI runs - with 3.6% impact on the compute
> time. So if you or others were afraid we were burning money, using power
> and wasting water, that was totally unfounded. This is was - as I
> explained - anecdotal experience, not something that happened consistency.
> Even if the impact is small, I added additional protection for similar
> cases in #68802.
>
> J.
>
>
> On Mon, Jun 22, 2026 at 1:22 AM Jarek Potiuk <[email protected]> wrote:
>
>> Hi all,
>>
>> This afternoon, I worked with my agents to conduct a detailed analysis of
>> our CI compute and elapsed times over the past two weeks. I want to share
>> the findings, a handful of PRs, and — more importantly — set realistic (and
>> data-backed) expectations about what will and won't move the needle.
>>
>> This is a long one, but since there were just **feelings** that things
>> are different than they are, I wanted to clarify it with precise data
>> gathered over last 2 weeks.
>>
>> I think discussions on the dev call where there is no way to look at
>> details and drag data and where we mostly base any statements on just
>> wishful thinking - and where anecdotal experiences might not match
>> statistical reality—are unproductive.
>>
>> I think we should base more of our discussions on data and not "because
>> we think things are different".
>>
>> Full write-up (numbers, tables, methodology):
>> https://gist.github.com/potiuk/0ddc404dc76353db9849b801e7d43e26
>>
>>
>>
>> *== Selective checks already save a LOT ==*
>> The headline is that our selective-checks machinery is already doing an
>> enormous amount of work. Measured over two weeks:
>>
>> - vs running full tests for every PR: ~60% fewer compute-hours
>> - vs the complete all-versions matrix (canary): ~81% fewer
>>
>> So PRs already run at roughly 19-40% of "run everything." The system
>> works, and it works well. I want to be clear about that before anyone reads
>> the optimizations as "CI is broken" — it isn't.
>>
>>
>> *== The optimizations are small and incremental — with one exception ==*
>> The one genuinely large item is not really an optimization, it's fixing a
>> mistake merged 2 weeks ago (!): generated/provider_dependencies.json was
>> accidentally re-added to git tracking ~2 weeks ago together with the
>> *ClickHouse* *provider *(even if the provider_dependencies.json + shasum
>> were gitignored - they were forcefully added).
>>
>> Because it was tracked again, every provider-dependency regeneration
>> looked like a dependency change and forced the full all-Python-versions
>> matrix — ~2,700 compute-hours / 2 weeks for nothing. Removing it (#68801)
>> should also fix some of the issues that few of us had when the generated
>> and committed file was stale - that was one of the reasons I removed it in
>> April 2025 when we switched our infra to uv. That's almost 12% less compute.
>>
>> Everything else is genuinely small and incremental, yielding only a few
>> percent each:
>>
>> - #68533 (merged) — standard venv-operator tests made DB-free so they
>> parallelize
>> - #68802 — stop non-test workflows + prek-only changes forcing the full
>> matrix
>> - #68814 — rebalance the provider test groups (split the serial monolith)
>>
>> - #68821 — reduce canary frequency (AMD 4->2/day, ARM 8->2/day). This one
>> is a real TRADE-OFF, not a free win: it saves ~6% of AMD compute, but it
>> makes our investigations potentially harder and our reaction time to main
>> regressions slower — a regression can sit undetected up to ~12h instead of
>> ~6h. We should only take it if we consciously accept that.
>>
>> Combined these are worth ~20% of AMD compute — meaningful, but
>> incremental, and the canary one trades coverage/latency for it.
>>
>>
>> *== The real bottleneck is the shared ASF public runner queue, not our
>> tests ==*
>> Our jobs spent ~3,960 job-hours over two weeks simply *waiting for a
>> runner*, and jobs in big runs wait ~2.5x longer than in small ones because
>> each big run floods the shared pool. No amount of our own test-trimming
>> changes this fundamentally — it's a capacity/contention problem, not a "we
>> test too much" problem.
>>
>> And the important reality check, based on history: the only thing that
>> changes the contributor experience *dramatically* — historically a ~4x
>> speedup in elapsed time — is more powerful hardware OUTSIDE the shared ASF
>> public runner queue. Our incremental optimizations help at the margins;
>> dedicated hardware is the step change.
>>
>>
>> *== Why "complaining" to Infra is the wrong move ==*
>> We have been here before—more than 5 years ago, when Infra had far fewer
>> runners, we went through a very similar exercise. The structural issues
>> haven't changed:
>>
>>    - The public runner pool is SHARED across all Apache projects. Other
>>    projects may or may not optimize their tests the way we do — we can't
>>    control that.
>>    - Everyone's traffic is up sharply because of AI-generated PRs, so
>>    the contention is systemic, not Airflow-specific.
>>    - Infra has no real mechanism to fix this for us other than what they
>>    have always recommended: a PMC arranging its own dedicated, self-hosted
>>    runners. If you look at the discussion we had in October 2020
>>    https://lists.apache.org/thread/8htrdgf2h8qz1hv7mbb96v8l8x8d1dyl -
>>    this was the only solution that we could apply back then, and it worked.
>>
>>
>> *Vikram* — if you want to follow up on writing to them, that's fine, but
>> the message has to recognize the reality we're in. Last time, the thing
>> that actually worked was GitHub giving us ~3x more runners. That is
>> unlikely to repeat: GitHub themselves now have to scale their own
>> infrastructure 20-30x for the same AI-driven reasons, so a "please give us
>> 3x again" ask is not realistic. If we write, it has to be grounded in that.
>>
>> *Shahar* — I think your exploration of Kubernetes self-hosted runners on
>> cheaper spot instances is the most promising path by far. That is the lever
>> that gives us the ~4x, on our own terms, without depending on the shared
>> pool or on Infra's capacity - and one that we can easily control which runs
>> should be faster - canary runs + committer runs seems like a good idea -
>> and it  is the only *100% *resilient to external AI-generated flood. I'd
>> strongly suggest continuing it — it's the fastest way to actually improve
>> our contributors' experience.
>>
>> == Next steps ==
>>
>> Everyone - please review/approve those PRs and let's merge them. They
>> will help "a little," but won't solve the underlying runner contention:
>>
>> * https://github.com/apache/airflow/pull/68801
>> * https://github.com/apache/airflow/pull/68802
>> * https://github.com/apache/airflow/pull/68814
>> * https://github.com/apache/airflow/pull/68821
>>
>> Happy to discuss any of the above (based on data and analysis - ideally).
>>
>> Also, if others have concrete ideas on **what** we can improve—such as
>> which tests we might skip under certain circumstances - all concrete ideas
>> are welcome. The Gist above has a lot of data.
>>
>> But we should all realise that pretty much everything here is a trade-off
>> - usually we trade elapsed time and compute time for greater certainty that
>> a merge will not break all other PRs.
>>
>> Best,
>> Jarek
>>
>

Reply via email to