> Also on the 3.5 side the CI is super broken so I’m trying to fix it up now, > the timing is complicated by the Ubuntu PPA DDoS outages.
I'm working on fixing branch-3.5 CI: https://github.com/apache/spark/pull/55764. Hopefully I'll complete it this week. The Ubuntu outage seems unrelated. Anyway, I'm +1 to reduce the frequency on non-active branches. Thanks, Akira On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> wrote: > > Yeah I'm not surprised that 3.5 is not in its best shape at this point > because we almost did not run tests on it. When we reduce the coverage for a > branch, we will have issues when we try to release. That's why we should not > only make efforts on that side. We should explore all different ways to make > CI better. > > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> wrote: >> >> Smarter test selection is probably the magic but it’s going to be effort. >> Also on the 3.5 side the CI is super broken so I’m trying to fix it up now, >> the timing is complicated by the Ubuntu PPA DDoS outages. >> >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev <[email protected]> >> wrote: >>> >>> I definitely agree that we can save a lot of time by optimizing the CI. But >>> currently, java tests take more time than python tests. They are comparable >>> but java tests are still observably more expensive. We should not only >>> focus on python ones. >>> >>> In the meantime, I'll take a look on low hanging fruits on CI to make it >>> smarter. >>> >>> Tian >>> >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> wrote: >>>> >>>> I also did some data analysis, and think we should also revisit the the CI: >>>> 1, Deduplicate the compile. For example, the pyspark matrix executes 8 >>>> byte-identical SBT compiles in parallel today, costing ~108m of redundant >>>> work per run. >>>> (I am working on a POC: https://github.com/apache/spark/pull/55726) >>>> 2, Smarter test selection. 11% of recent 10000 commits are test-only >>>> changes. Today these trigger the full pyspark matrix because the dependency >>>> graph in dev/sparktestsupport/modules.py cascades through >>>> dependent_modules regardless of whether the change is in source or tests. >>>> The cascade is correct >>>> for source changes (downstream modules import the source) but >>>> unnecessary for tests (no production code imports test code). >>>> >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> wrote: >>>>> >>>>> For now, I created a PR to reduce the frequency by half: >>>>> https://github.com/apache/spark/pull/55729 >>>>> >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]> wrote: >>>>>> >>>>>> I think we need to 1) cut CIs pressure and 2) look for more resources to >>>>>> run CIs at the same time. >>>>>> >>>>>> Cut CIs: >>>>>> >>>>>> I think the biggest cut would be on the scheduled jobs first. For >>>>>> instance change 3.5 and 4.0 scheduled jobs from daily to once in three >>>>>> days, or even once per week. >>>>>> Then for branch 4.x or more active release branches we can do daily post >>>>>> merge CI, instead of after each commit? >>>>>> Meanwhile we can explore ways to run selected tests on the actual >>>>>> affected code path to avoid full runs. >>>>>> And optimize tests themselves so they run faster. >>>>>> >>>>>> Expand resources: >>>>>> >>>>>> We can probably move some of the scheduled jobs out to another repo like >>>>>> what Apache Arrow did. >>>>>> I wonder if self hosted runners are acceptable to the community? This >>>>>> sounds like a longer term solution if we were to introduce more checks >>>>>> in the future. >>>>>> >>>>>> >>>>>> Best regards, >>>>>> Yicong Huang >>>>>> >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon <[email protected]> wrote: >>>>>>> >>>>>>> We should probably reduce the scheduled build for the time being. >>>>>>> >>>>>>> As a reference, I worked in Apache Arrow, and they use an extra CI by >>>>>>> thirdparty, e.g., see >>>>>>> - PR: https://github.com/apache/arrow/pull/48915 >>>>>>> - You comment like >>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184 >>>>>>> - It posts the CI link like >>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993 >>>>>>> - The CI is defined at https://github.com/ursacomputing/crossbow >>>>>>> >>>>>>> I feel like this can be an alternative if any vendor is willing to >>>>>>> support it. >>>>>>> >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> I did some quick calculations, and we can't afford the CI with our >>>>>>>> existing infra. >>>>>>>> >>>>>>>> Per ASF policy (https://infra.apache.org/github-actions-policy.html), >>>>>>>> the maximum weekly runner minutes we have is 250k. That's 1m per >>>>>>>> month, and last month, we hit almost the exact number - 1,082,721 >>>>>>>> minutes. >>>>>>>> >>>>>>>> Our current CI consists of a few components (all numbers are per >>>>>>>> month): >>>>>>>> * each commits on master branch - ~280k >>>>>>>> * 4.1 scheduled run - ~200k >>>>>>>> * 4.0 scheduled run - ~200k >>>>>>>> * 3.5 scheduled run - negligible because we don't run many tests >>>>>>>> * master scheduled run ~ 300k >>>>>>>> >>>>>>>> With the new release cadence, even if we only do scheduled run on 4.x >>>>>>>> (which we shouldn't because it's an active dev branch but that's >>>>>>>> another story), we need an extra 200k. With a 6-month maintenance >>>>>>>> window, we will always have at least 3 active maintained versions >>>>>>>> (including LTS) that require CI. >>>>>>>> >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really believe >>>>>>>> we need tests for the 4.x branch - we should treat that branch more >>>>>>>> like master, than say 4.2. Even if we don't do pre-merge check on it, >>>>>>>> we should do post-merge check for every commit. Daily check on an >>>>>>>> active dev branch sounds a bit too risky to me. That would be another >>>>>>>> 300k. >>>>>>>> >>>>>>>> This does not include the discussion about any pre-merge check for >>>>>>>> 4.x, which we should actually think about in the future. >>>>>>>> >>>>>>>> So the question is - how do we deal with that? The solutions I can >>>>>>>> think of are >>>>>>>> * Get some self-host runners and increase our CI capability limited by >>>>>>>> ASF policy >>>>>>>> * Optimize our CIs and tests so it takes less time to run >>>>>>>> * Reduce the coverage of our tests so we can at least test all branches >>>>>>>> >>>>>>>> Any idea is welcome. >>>>>>>> >>>>>>>> Tian --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
