Re: [DISCUSS] How can we afford CI for the new release cadence?

Akira Ajisaka Sun, 10 May 2026 23:22:44 -0700

> Also on the 3.5 side the CI is super broken so I’m trying to fix it up now, 
> the timing is complicated by the Ubuntu PPA DDoS outages.


I'm working on fixing branch-3.5 CI:
https://github.com/apache/spark/pull/55764. Hopefully I'll complete it
this week. The Ubuntu outage seems unrelated.

Anyway, I'm +1 to reduce the frequency on non-active branches.

Thanks,
Akira

On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> wrote:
>
> Yeah I'm not surprised that 3.5 is not in its best shape at this point 
> because we almost did not run tests on it. When we reduce the coverage for a 
> branch, we will have issues when we try to release. That's why we should not 
> only make efforts on that side. We should explore all different ways to make 
> CI better.
>
> On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> wrote:
>>
>> Smarter test selection is probably the magic but it’s going to be effort. 
>> Also on the 3.5 side the CI is super broken so I’m trying to fix it up now, 
>> the timing is complicated by the Ubuntu PPA DDoS outages.
>>
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev <[email protected]> 
>> wrote:
>>>
>>> I definitely agree that we can save a lot of time by optimizing the CI. But 
>>> currently, java tests take more time than python tests. They are comparable 
>>> but java tests are still observably more expensive. We should not only 
>>> focus on python ones.
>>>
>>> In the meantime, I'll take a look on low hanging fruits on CI to make it 
>>> smarter.
>>>
>>> Tian
>>>
>>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> wrote:
>>>>
>>>> I also did some data analysis, and think we should also revisit the the CI:
>>>> 1, Deduplicate the compile. For example, the pyspark matrix executes 8 
>>>> byte-identical SBT compiles in parallel today, costing ~108m of redundant 
>>>> work per run.
>>>>    (I am working on a POC: https://github.com/apache/spark/pull/55726)
>>>> 2, Smarter test selection. 11% of recent 10000 commits are test-only 
>>>> changes. Today these trigger the full pyspark matrix because the dependency
>>>>    graph in dev/sparktestsupport/modules.py cascades through 
>>>> dependent_modules regardless of whether the change is in source or tests. 
>>>> The cascade is correct
>>>>    for source changes (downstream modules import the source) but 
>>>> unnecessary for tests (no production code imports test code).
>>>>
>>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> wrote:
>>>>>
>>>>> For now, I created a PR to reduce the frequency by half: 
>>>>> https://github.com/apache/spark/pull/55729
>>>>>
>>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]> wrote:
>>>>>>
>>>>>> I think we need to 1) cut CIs pressure and 2) look for more resources to 
>>>>>> run CIs at the same time.
>>>>>>
>>>>>> Cut CIs:
>>>>>>
>>>>>> I think the biggest cut would be on the scheduled jobs first. For 
>>>>>> instance change 3.5 and 4.0 scheduled jobs from daily to once in three 
>>>>>> days, or even once per week.
>>>>>> Then for branch 4.x or more active release branches we can do daily post 
>>>>>> merge CI, instead of after each commit?
>>>>>> Meanwhile we can explore ways to run selected tests on the actual 
>>>>>> affected code path to avoid full runs.
>>>>>> And optimize tests themselves so they run faster.
>>>>>>
>>>>>> Expand resources:
>>>>>>
>>>>>> We can probably move some of the scheduled jobs out to another repo like 
>>>>>> what Apache Arrow did.
>>>>>> I wonder if self hosted runners are acceptable to the community? This 
>>>>>> sounds like a longer term solution if we were to introduce more checks 
>>>>>> in the future.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Yicong Huang
>>>>>>
>>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon <[email protected]> wrote:
>>>>>>>
>>>>>>> We should probably reduce the scheduled build for the time being.
>>>>>>>
>>>>>>> As a reference, I worked in Apache Arrow, and they use an extra CI by 
>>>>>>> thirdparty, e.g., see
>>>>>>> - PR: https://github.com/apache/arrow/pull/48915
>>>>>>> - You comment like 
>>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184
>>>>>>> - It posts the CI link like 
>>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993
>>>>>>> - The CI is defined at https://github.com/ursacomputing/crossbow
>>>>>>>
>>>>>>> I feel like this can be an alternative if any vendor is willing to 
>>>>>>> support it.
>>>>>>>
>>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev <[email protected]> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I did some quick calculations, and we can't afford the CI with our 
>>>>>>>> existing infra.
>>>>>>>>
>>>>>>>> Per ASF policy (https://infra.apache.org/github-actions-policy.html), 
>>>>>>>> the maximum weekly runner minutes we have is 250k. That's 1m per 
>>>>>>>> month, and last month, we hit almost the exact number - 1,082,721 
>>>>>>>> minutes.
>>>>>>>>
>>>>>>>> Our current CI consists of a few components (all numbers are per 
>>>>>>>> month):
>>>>>>>> * each commits on master branch - ~280k
>>>>>>>> * 4.1 scheduled run - ~200k
>>>>>>>> * 4.0 scheduled run - ~200k
>>>>>>>> * 3.5 scheduled run - negligible because we don't run many tests
>>>>>>>> * master scheduled run ~ 300k
>>>>>>>>
>>>>>>>> With the new release cadence, even if we only do scheduled run on 4.x 
>>>>>>>> (which we shouldn't because it's an active dev branch but that's 
>>>>>>>> another story), we need an extra 200k. With a 6-month maintenance 
>>>>>>>> window, we will always have at least 3 active maintained versions 
>>>>>>>> (including LTS) that require CI.
>>>>>>>>
>>>>>>>> If it's just 200k extra, maybe it's manageable. But I really believe 
>>>>>>>> we need tests for the 4.x branch - we should treat that branch more 
>>>>>>>> like master, than say 4.2. Even if we don't do pre-merge check on it, 
>>>>>>>> we should do post-merge check for every commit. Daily check on an 
>>>>>>>> active dev branch sounds a bit too risky to me. That would be another 
>>>>>>>> 300k.
>>>>>>>>
>>>>>>>> This does not include the discussion about any pre-merge check for 
>>>>>>>> 4.x, which we should actually think about in the future.
>>>>>>>>
>>>>>>>> So the question is - how do we deal with that? The solutions I can 
>>>>>>>> think of are
>>>>>>>> * Get some self-host runners and increase our CI capability limited by 
>>>>>>>> ASF policy
>>>>>>>> * Optimize our CIs and tests so it takes less time to run
>>>>>>>> * Reduce the coverage of our tests so we can at least test all branches
>>>>>>>>
>>>>>>>> Any idea is welcome.
>>>>>>>>
>>>>>>>> Tian

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [DISCUSS] How can we afford CI for the new release cadence?

Reply via email to