Is it possible to run 4.x branch post merge as a scheduled job, possibly daily, instead of after every commit? I think this can quickly cut the CI usage.
Best, Yicong Huang On Fri, May 22, 2026 at 11:48 AM Tian Gao via dev <[email protected]> wrote: > Like I mentioned a few weeks ago, we can't afford this. We received the > warning from ASF today and took a quick look at our CI usage. > > We are using about 350k min/week now, and the limit is 250k min/week. The > post merge itself took 180k+ min/week because now we have 2 active dev > branches. > > I think we should put some effort into this. There are a few ways to make > the situation better: > > 1. Run fewer tests - We disabled pandas on spark tests for post merge a > while ago to comply with the ASF limit. > 2. Make tests run faster - I occasionally optimize python tests, not sure > if Java tests are being taken care of. Java tests took significantly > more time in our CI now. > 3. Run tests less frequently - helpful for scheduled CI which we already > did, but won't help post merge. > 4. Smart testing - this is a bit tricky for post-merge because ideally we > want a full coverage for each commit. We can probably do some safe > heuristics, but it takes time and we could potentially lose coverage. > 5. Move scheduled tests to another repo - arrow seems to be doing this. > This allows us to use all the ASF budget to run post-merge tests. However, > we need some sponsor to achieve this. > > I think we have 2 weeks to at least temporarily reduce our CI usage under > the limit, so we need something fast, then something good. > > Tian > > On Mon, May 11, 2026 at 3:14 AM Akira Ajisaka <[email protected]> wrote: > >> > I'm working on fixing branch-3.5 CI: >> https://github.com/apache/spark/pull/55764 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55764&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261437820%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=32tQ4QP4bp4Rp%2Fby48RT9H%2FJc%2FxDzmHnKAcOgliiGX0%3D&reserved=0>. >> Hopefully I'll complete it this week. >> >> Closed the above PR as a duplicate of >> https://github.com/apache/spark/pull/55432 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55432&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261491001%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=RpGKoF%2Faw%2F3WPvqypNsgAcaLtl6do8A21UHMjnIoGR0%3D&reserved=0>. >> Sorry for the confusion. >> >> On Mon, May 11, 2026 at 3:22 PM Akira Ajisaka <[email protected]> >> wrote: >> > >> > > Also on the 3.5 side the CI is super broken so I’m trying to fix it >> up now, the timing is complicated by the Ubuntu PPA DDoS outages. >> > >> > I'm working on fixing branch-3.5 CI: >> > https://github.com/apache/spark/pull/55764 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55764&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261508821%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=1zB4MkjyMitFytJ4EBTv59q0SDwT%2BGiRrRK6rgqPyIM%3D&reserved=0>. >> Hopefully I'll complete it >> > this week. The Ubuntu outage seems unrelated. >> > >> > Anyway, I'm +1 to reduce the frequency on non-active branches. >> > >> > Thanks, >> > Akira >> > >> > On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> >> wrote: >> > > >> > > Yeah I'm not surprised that 3.5 is not in its best shape at this >> point because we almost did not run tests on it. When we reduce the >> coverage for a branch, we will have issues when we try to release. That's >> why we should not only make efforts on that side. We should explore all >> different ways to make CI better. >> > > >> > > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> >> wrote: >> > >> >> > >> Smarter test selection is probably the magic but it’s going to be >> effort. Also on the 3.5 side the CI is super broken so I’m trying to fix it >> up now, the timing is complicated by the Ubuntu PPA DDoS outages. >> > >> >> > >> >> > >> Twitter: https://twitter.com/holdenkarau >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261525901%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=aFFfISTgnaDYMcPmA06d4Vvd2c44ywoBQziwGtXzKsw%3D&reserved=0> >> > >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.fighthealthinsurance.com%2F&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261542909%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=3aW7yRQFZELYPxGwPAvTa%2B1VOeB1DP%2BNlgzKODlj%2B9U%3D&reserved=0> >> > >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261559904%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=Y8Zo1UiKnIqFYIUtcg%2FFu5suNiYo0wYgn1gVby4CXMI%3D&reserved=0> >> > >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261578297%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=RDMAZ75eTV%2F%2B7xp6gbyXmaxxlKw87dhLBLEuq%2FIIEic%3D&reserved=0> >> > >> Pronouns: she/her >> > >> >> > >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev < >> [email protected]> wrote: >> > >>> >> > >>> I definitely agree that we can save a lot of time by optimizing the >> CI. But currently, java tests take more time than python tests. They are >> comparable but java tests are still observably more expensive. We should >> not only focus on python ones. >> > >>> >> > >>> In the meantime, I'll take a look on low hanging fruits on CI to >> make it smarter. >> > >>> >> > >>> Tian >> > >>> >> > >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> >> wrote: >> > >>>> >> > >>>> I also did some data analysis, and think we should also revisit >> the the CI: >> > >>>> 1, Deduplicate the compile. For example, the pyspark matrix >> executes 8 byte-identical SBT compiles in parallel today, costing ~108m of >> redundant work per run. >> > >>>> (I am working on a POC: >> https://github.com/apache/spark/pull/55726 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55726&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261599859%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=TxDWj%2BnFzWOTEgy31O2uZeoE0oOJhPRUquq4tOWgqBQ%3D&reserved=0> >> ) >> > >>>> 2, Smarter test selection. 11% of recent 10000 commits are >> test-only changes. Today these trigger the full pyspark matrix because the >> dependency >> > >>>> graph in dev/sparktestsupport/modules.py cascades through >> dependent_modules regardless of whether the change is in source or tests. >> The cascade is correct >> > >>>> for source changes (downstream modules import the source) but >> unnecessary for tests (no production code imports test code). >> > >>>> >> > >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> >> wrote: >> > >>>>> >> > >>>>> For now, I created a PR to reduce the frequency by half: >> https://github.com/apache/spark/pull/55729 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55729&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261623806%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=JCzWtIDdcS7Nv6gtv7KAIBioHdhKHUa%2F4VtuBFmlTCg%3D&reserved=0> >> > >>>>> >> > >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]> >> wrote: >> > >>>>>> >> > >>>>>> I think we need to 1) cut CIs pressure and 2) look for more >> resources to run CIs at the same time. >> > >>>>>> >> > >>>>>> Cut CIs: >> > >>>>>> >> > >>>>>> I think the biggest cut would be on the scheduled jobs first. >> For instance change 3.5 and 4.0 scheduled jobs from daily to once in three >> days, or even once per week. >> > >>>>>> Then for branch 4.x or more active release branches we can do >> daily post merge CI, instead of after each commit? >> > >>>>>> Meanwhile we can explore ways to run selected tests on the >> actual affected code path to avoid full runs. >> > >>>>>> And optimize tests themselves so they run faster. >> > >>>>>> >> > >>>>>> Expand resources: >> > >>>>>> >> > >>>>>> We can probably move some of the scheduled jobs out to another >> repo like what Apache Arrow did. >> > >>>>>> I wonder if self hosted runners are acceptable to the community? >> This sounds like a longer term solution if we were to introduce more checks >> in the future. >> > >>>>>> >> > >>>>>> >> > >>>>>> Best regards, >> > >>>>>> Yicong Huang >> > >>>>>> >> > >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon < >> [email protected]> wrote: >> > >>>>>>> >> > >>>>>>> We should probably reduce the scheduled build for the time >> being. >> > >>>>>>> >> > >>>>>>> As a reference, I worked in Apache Arrow, and they use an extra >> CI by thirdparty, e.g., see >> > >>>>>>> - PR: https://github.com/apache/arrow/pull/48915 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F48915&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261649564%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=pO7KvG4N7nYkiE9OM8BxWSgxhkqKQJGyOZEcv4sZKy4%3D&reserved=0> >> > >>>>>>> - You comment like >> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F48915%23issuecomment-3852062184&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261686934%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=8RQy6xfBAuwucM1wkqb0qEIvrjZVwMr8bWrByPOOZ78%3D&reserved=0> >> > >>>>>>> - It posts the CI link like >> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993 >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F48915%23issuecomment-3852079993&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261703594%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=m0qLqH1BBER1xUuF0Stp3asVlA0PNP8kr%2F%2Bcw%2BX3Cew%3D&reserved=0> >> > >>>>>>> - The CI is defined at >> https://github.com/ursacomputing/crossbow >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursacomputing%2Fcrossbow&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261719788%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=tJRx0dNJD4obwoBKIPrSmBikJdirhy7UmkzJOWksVF4%3D&reserved=0> >> > >>>>>>> >> > >>>>>>> I feel like this can be an alternative if any vendor is willing >> to support it. >> > >>>>>>> >> > >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev < >> [email protected]> wrote: >> > >>>>>>>> >> > >>>>>>>> I did some quick calculations, and we can't afford the CI with >> our existing infra. >> > >>>>>>>> >> > >>>>>>>> Per ASF policy ( >> https://infra.apache.org/github-actions-policy.html >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Finfra.apache.org%2Fgithub-actions-policy.html&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261737519%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=cxw2%2Fa8o%2FEKi75VskoCTcYJ24AOhBlOshNtrnjO%2BttM%3D&reserved=0>), >> the maximum weekly runner minutes we have is 250k. That's 1m per month, and >> last month, we hit almost the exact number - 1,082,721 minutes. >> > >>>>>>>> >> > >>>>>>>> Our current CI consists of a few components (all numbers are >> per month): >> > >>>>>>>> * each commits on master branch - ~280k >> > >>>>>>>> * 4.1 scheduled run - ~200k >> > >>>>>>>> * 4.0 scheduled run - ~200k >> > >>>>>>>> * 3.5 scheduled run - negligible because we don't run many >> tests >> > >>>>>>>> * master scheduled run ~ 300k >> > >>>>>>>> >> > >>>>>>>> With the new release cadence, even if we only do scheduled run >> on 4.x (which we shouldn't because it's an active dev branch but that's >> another story), we need an extra 200k. With a 6-month maintenance window, >> we will always have at least 3 active maintained versions (including LTS) >> that require CI. >> > >>>>>>>> >> > >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really >> believe we need tests for the 4.x branch - we should treat that branch more >> like master, than say 4.2. Even if we don't do pre-merge check on it, we >> should do post-merge check for every commit. Daily check on an active dev >> branch sounds a bit too risky to me. That would be another 300k. >> > >>>>>>>> >> > >>>>>>>> This does not include the discussion about any pre-merge check >> for 4.x, which we should actually think about in the future. >> > >>>>>>>> >> > >>>>>>>> So the question is - how do we deal with that? The solutions I >> can think of are >> > >>>>>>>> * Get some self-host runners and increase our CI capability >> limited by ASF policy >> > >>>>>>>> * Optimize our CIs and tests so it takes less time to run >> > >>>>>>>> * Reduce the coverage of our tests so we can at least test all >> branches >> > >>>>>>>> >> > >>>>>>>> Any idea is welcome. >> > >>>>>>>> >> > >>>>>>>> Tian >> >
