Like I mentioned a few weeks ago, we can't afford this. We received the warning from ASF today and took a quick look at our CI usage.
We are using about 350k min/week now, and the limit is 250k min/week. The post merge itself took 180k+ min/week because now we have 2 active dev branches. I think we should put some effort into this. There are a few ways to make the situation better: 1. Run fewer tests - We disabled pandas on spark tests for post merge a while ago to comply with the ASF limit. 2. Make tests run faster - I occasionally optimize python tests, not sure if Java tests are being taken care of. Java tests took significantly more time in our CI now. 3. Run tests less frequently - helpful for scheduled CI which we already did, but won't help post merge. 4. Smart testing - this is a bit tricky for post-merge because ideally we want a full coverage for each commit. We can probably do some safe heuristics, but it takes time and we could potentially lose coverage. 5. Move scheduled tests to another repo - arrow seems to be doing this. This allows us to use all the ASF budget to run post-merge tests. However, we need some sponsor to achieve this. I think we have 2 weeks to at least temporarily reduce our CI usage under the limit, so we need something fast, then something good. Tian On Mon, May 11, 2026 at 3:14 AM Akira Ajisaka <[email protected]> wrote: > > I'm working on fixing branch-3.5 CI: > https://github.com/apache/spark/pull/55764. Hopefully I'll complete it > this week. > > Closed the above PR as a duplicate of > https://github.com/apache/spark/pull/55432. Sorry for the confusion. > > On Mon, May 11, 2026 at 3:22 PM Akira Ajisaka <[email protected]> wrote: > > > > > Also on the 3.5 side the CI is super broken so I’m trying to fix it up > now, the timing is complicated by the Ubuntu PPA DDoS outages. > > > > I'm working on fixing branch-3.5 CI: > > https://github.com/apache/spark/pull/55764. Hopefully I'll complete it > > this week. The Ubuntu outage seems unrelated. > > > > Anyway, I'm +1 to reduce the frequency on non-active branches. > > > > Thanks, > > Akira > > > > On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> > wrote: > > > > > > Yeah I'm not surprised that 3.5 is not in its best shape at this point > because we almost did not run tests on it. When we reduce the coverage for > a branch, we will have issues when we try to release. That's why we should > not only make efforts on that side. We should explore all different ways to > make CI better. > > > > > > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> > wrote: > > >> > > >> Smarter test selection is probably the magic but it’s going to be > effort. Also on the 3.5 side the CI is super broken so I’m trying to fix it > up now, the timing is complicated by the Ubuntu PPA DDoS outages. > > >> > > >> > > >> Twitter: https://twitter.com/holdenkarau > > >> Fight Health Insurance: https://www.fighthealthinsurance.com/ > > >> Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > > >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > >> Pronouns: she/her > > >> > > >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev < > [email protected]> wrote: > > >>> > > >>> I definitely agree that we can save a lot of time by optimizing the > CI. But currently, java tests take more time than python tests. They are > comparable but java tests are still observably more expensive. We should > not only focus on python ones. > > >>> > > >>> In the meantime, I'll take a look on low hanging fruits on CI to > make it smarter. > > >>> > > >>> Tian > > >>> > > >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> > wrote: > > >>>> > > >>>> I also did some data analysis, and think we should also revisit the > the CI: > > >>>> 1, Deduplicate the compile. For example, the pyspark matrix > executes 8 byte-identical SBT compiles in parallel today, costing ~108m of > redundant work per run. > > >>>> (I am working on a POC: > https://github.com/apache/spark/pull/55726) > > >>>> 2, Smarter test selection. 11% of recent 10000 commits are > test-only changes. Today these trigger the full pyspark matrix because the > dependency > > >>>> graph in dev/sparktestsupport/modules.py cascades through > dependent_modules regardless of whether the change is in source or tests. > The cascade is correct > > >>>> for source changes (downstream modules import the source) but > unnecessary for tests (no production code imports test code). > > >>>> > > >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> > wrote: > > >>>>> > > >>>>> For now, I created a PR to reduce the frequency by half: > https://github.com/apache/spark/pull/55729 > > >>>>> > > >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]> > wrote: > > >>>>>> > > >>>>>> I think we need to 1) cut CIs pressure and 2) look for more > resources to run CIs at the same time. > > >>>>>> > > >>>>>> Cut CIs: > > >>>>>> > > >>>>>> I think the biggest cut would be on the scheduled jobs first. For > instance change 3.5 and 4.0 scheduled jobs from daily to once in three > days, or even once per week. > > >>>>>> Then for branch 4.x or more active release branches we can do > daily post merge CI, instead of after each commit? > > >>>>>> Meanwhile we can explore ways to run selected tests on the actual > affected code path to avoid full runs. > > >>>>>> And optimize tests themselves so they run faster. > > >>>>>> > > >>>>>> Expand resources: > > >>>>>> > > >>>>>> We can probably move some of the scheduled jobs out to another > repo like what Apache Arrow did. > > >>>>>> I wonder if self hosted runners are acceptable to the community? > This sounds like a longer term solution if we were to introduce more checks > in the future. > > >>>>>> > > >>>>>> > > >>>>>> Best regards, > > >>>>>> Yicong Huang > > >>>>>> > > >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon <[email protected]> > wrote: > > >>>>>>> > > >>>>>>> We should probably reduce the scheduled build for the time being. > > >>>>>>> > > >>>>>>> As a reference, I worked in Apache Arrow, and they use an extra > CI by thirdparty, e.g., see > > >>>>>>> - PR: https://github.com/apache/arrow/pull/48915 > > >>>>>>> - You comment like > https://github.com/apache/arrow/pull/48915#issuecomment-3852062184 > > >>>>>>> - It posts the CI link like > https://github.com/apache/arrow/pull/48915#issuecomment-3852079993 > > >>>>>>> - The CI is defined at https://github.com/ursacomputing/crossbow > > >>>>>>> > > >>>>>>> I feel like this can be an alternative if any vendor is willing > to support it. > > >>>>>>> > > >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev < > [email protected]> wrote: > > >>>>>>>> > > >>>>>>>> I did some quick calculations, and we can't afford the CI with > our existing infra. > > >>>>>>>> > > >>>>>>>> Per ASF policy ( > https://infra.apache.org/github-actions-policy.html), the maximum weekly > runner minutes we have is 250k. That's 1m per month, and last month, we hit > almost the exact number - 1,082,721 minutes. > > >>>>>>>> > > >>>>>>>> Our current CI consists of a few components (all numbers are > per month): > > >>>>>>>> * each commits on master branch - ~280k > > >>>>>>>> * 4.1 scheduled run - ~200k > > >>>>>>>> * 4.0 scheduled run - ~200k > > >>>>>>>> * 3.5 scheduled run - negligible because we don't run many tests > > >>>>>>>> * master scheduled run ~ 300k > > >>>>>>>> > > >>>>>>>> With the new release cadence, even if we only do scheduled run > on 4.x (which we shouldn't because it's an active dev branch but that's > another story), we need an extra 200k. With a 6-month maintenance window, > we will always have at least 3 active maintained versions (including LTS) > that require CI. > > >>>>>>>> > > >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really > believe we need tests for the 4.x branch - we should treat that branch more > like master, than say 4.2. Even if we don't do pre-merge check on it, we > should do post-merge check for every commit. Daily check on an active dev > branch sounds a bit too risky to me. That would be another 300k. > > >>>>>>>> > > >>>>>>>> This does not include the discussion about any pre-merge check > for 4.x, which we should actually think about in the future. > > >>>>>>>> > > >>>>>>>> So the question is - how do we deal with that? The solutions I > can think of are > > >>>>>>>> * Get some self-host runners and increase our CI capability > limited by ASF policy > > >>>>>>>> * Optimize our CIs and tests so it takes less time to run > > >>>>>>>> * Reduce the coverage of our tests so we can at least test all > branches > > >>>>>>>> > > >>>>>>>> Any idea is welcome. > > >>>>>>>> > > >>>>>>>> Tian >
