Re: [DISCUSS] How can we afford CI for the new release cadence?

Tian Gao via dev Fri, 22 May 2026 11:47:05 -0700

Like I mentioned a few weeks ago, we can't afford this. We received the
warning from ASF today and took a quick look at our CI usage.


We are using about 350k min/week now, and the limit is 250k min/week. The
post merge itself took 180k+ min/week because now we have 2 active dev
branches.

I think we should put some effort into this. There are a few ways to make
the situation better:

1. Run fewer tests - We disabled pandas on spark tests for post merge a
while ago to comply with the ASF limit.
2. Make tests run faster - I occasionally optimize python tests, not sure
if Java tests are being taken care of. Java tests took significantly
more time in our CI now.
3. Run tests less frequently - helpful for scheduled CI which we already
did, but won't help post merge.
4. Smart testing - this is a bit tricky for post-merge because ideally we
want a full coverage for each commit. We can probably do some safe
heuristics, but it takes time and we could potentially lose coverage.
5. Move scheduled tests to another repo - arrow seems to be doing this.
This allows us to use all the ASF budget to run post-merge tests. However,
we need some sponsor to achieve this.

I think we have 2 weeks to at least temporarily reduce our CI usage under
the limit, so we need something fast, then something good.

Tian

On Mon, May 11, 2026 at 3:14 AM Akira Ajisaka <[email protected]> wrote:

> > I'm working on fixing branch-3.5 CI:
> https://github.com/apache/spark/pull/55764. Hopefully I'll complete it
> this week.
>
> Closed the above PR as a duplicate of
> https://github.com/apache/spark/pull/55432. Sorry for the confusion.
>
> On Mon, May 11, 2026 at 3:22 PM Akira Ajisaka <[email protected]> wrote:
> >
> > > Also on the 3.5 side the CI is super broken so I’m trying to fix it up
> now, the timing is complicated by the Ubuntu PPA DDoS outages.
> >
> > I'm working on fixing branch-3.5 CI:
> > https://github.com/apache/spark/pull/55764. Hopefully I'll complete it
> > this week. The Ubuntu outage seems unrelated.
> >
> > Anyway, I'm +1 to reduce the frequency on non-active branches.
> >
> > Thanks,
> > Akira
> >
> > On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]>
> wrote:
> > >
> > > Yeah I'm not surprised that 3.5 is not in its best shape at this point
> because we almost did not run tests on it. When we reduce the coverage for
> a branch, we will have issues when we try to release. That's why we should
> not only make efforts on that side. We should explore all different ways to
> make CI better.
> > >
> > > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]>
> wrote:
> > >>
> > >> Smarter test selection is probably the magic but it’s going to be
> effort. Also on the 3.5 side the CI is super broken so I’m trying to fix it
> up now, the timing is complicated by the Ubuntu PPA DDoS outages.
> > >>
> > >>
> > >> Twitter: https://twitter.com/holdenkarau
> > >> Fight Health Insurance: https://www.fighthealthinsurance.com/
> > >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> > >> Pronouns: she/her
> > >>
> > >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev <
> [email protected]> wrote:
> > >>>
> > >>> I definitely agree that we can save a lot of time by optimizing the
> CI. But currently, java tests take more time than python tests. They are
> comparable but java tests are still observably more expensive. We should
> not only focus on python ones.
> > >>>
> > >>> In the meantime, I'll take a look on low hanging fruits on CI to
> make it smarter.
> > >>>
> > >>> Tian
> > >>>
> > >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]>
> wrote:
> > >>>>
> > >>>> I also did some data analysis, and think we should also revisit the
> the CI:
> > >>>> 1, Deduplicate the compile. For example, the pyspark matrix
> executes 8 byte-identical SBT compiles in parallel today, costing ~108m of
> redundant work per run.
> > >>>>    (I am working on a POC:
> https://github.com/apache/spark/pull/55726)
> > >>>> 2, Smarter test selection. 11% of recent 10000 commits are
> test-only changes. Today these trigger the full pyspark matrix because the
> dependency
> > >>>>    graph in dev/sparktestsupport/modules.py cascades through
> dependent_modules regardless of whether the change is in source or tests.
> The cascade is correct
> > >>>>    for source changes (downstream modules import the source) but
> unnecessary for tests (no production code imports test code).
> > >>>>
> > >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]>
> wrote:
> > >>>>>
> > >>>>> For now, I created a PR to reduce the frequency by half:
> https://github.com/apache/spark/pull/55729
> > >>>>>
> > >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]>
> wrote:
> > >>>>>>
> > >>>>>> I think we need to 1) cut CIs pressure and 2) look for more
> resources to run CIs at the same time.
> > >>>>>>
> > >>>>>> Cut CIs:
> > >>>>>>
> > >>>>>> I think the biggest cut would be on the scheduled jobs first. For
> instance change 3.5 and 4.0 scheduled jobs from daily to once in three
> days, or even once per week.
> > >>>>>> Then for branch 4.x or more active release branches we can do
> daily post merge CI, instead of after each commit?
> > >>>>>> Meanwhile we can explore ways to run selected tests on the actual
> affected code path to avoid full runs.
> > >>>>>> And optimize tests themselves so they run faster.
> > >>>>>>
> > >>>>>> Expand resources:
> > >>>>>>
> > >>>>>> We can probably move some of the scheduled jobs out to another
> repo like what Apache Arrow did.
> > >>>>>> I wonder if self hosted runners are acceptable to the community?
> This sounds like a longer term solution if we were to introduce more checks
> in the future.
> > >>>>>>
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>> Yicong Huang
> > >>>>>>
> > >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon <[email protected]>
> wrote:
> > >>>>>>>
> > >>>>>>> We should probably reduce the scheduled build for the time being.
> > >>>>>>>
> > >>>>>>> As a reference, I worked in Apache Arrow, and they use an extra
> CI by thirdparty, e.g., see
> > >>>>>>> - PR: https://github.com/apache/arrow/pull/48915
> > >>>>>>> - You comment like
> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184
> > >>>>>>> - It posts the CI link like
> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993
> > >>>>>>> - The CI is defined at https://github.com/ursacomputing/crossbow
> > >>>>>>>
> > >>>>>>> I feel like this can be an alternative if any vendor is willing
> to support it.
> > >>>>>>>
> > >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev <
> [email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>> I did some quick calculations, and we can't afford the CI with
> our existing infra.
> > >>>>>>>>
> > >>>>>>>> Per ASF policy (
> https://infra.apache.org/github-actions-policy.html), the maximum weekly
> runner minutes we have is 250k. That's 1m per month, and last month, we hit
> almost the exact number - 1,082,721 minutes.
> > >>>>>>>>
> > >>>>>>>> Our current CI consists of a few components (all numbers are
> per month):
> > >>>>>>>> * each commits on master branch - ~280k
> > >>>>>>>> * 4.1 scheduled run - ~200k
> > >>>>>>>> * 4.0 scheduled run - ~200k
> > >>>>>>>> * 3.5 scheduled run - negligible because we don't run many tests
> > >>>>>>>> * master scheduled run ~ 300k
> > >>>>>>>>
> > >>>>>>>> With the new release cadence, even if we only do scheduled run
> on 4.x (which we shouldn't because it's an active dev branch but that's
> another story), we need an extra 200k. With a 6-month maintenance window,
> we will always have at least 3 active maintained versions (including LTS)
> that require CI.
> > >>>>>>>>
> > >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really
> believe we need tests for the 4.x branch - we should treat that branch more
> like master, than say 4.2. Even if we don't do pre-merge check on it, we
> should do post-merge check for every commit. Daily check on an active dev
> branch sounds a bit too risky to me. That would be another 300k.
> > >>>>>>>>
> > >>>>>>>> This does not include the discussion about any pre-merge check
> for 4.x, which we should actually think about in the future.
> > >>>>>>>>
> > >>>>>>>> So the question is - how do we deal with that? The solutions I
> can think of are
> > >>>>>>>> * Get some self-host runners and increase our CI capability
> limited by ASF policy
> > >>>>>>>> * Optimize our CIs and tests so it takes less time to run
> > >>>>>>>> * Reduce the coverage of our tests so we can at least test all
> branches
> > >>>>>>>>
> > >>>>>>>> Any idea is welcome.
> > >>>>>>>>
> > >>>>>>>> Tian
>

Re: [DISCUSS] How can we afford CI for the new release cadence?

Reply via email to