Hi Manu, all of these were handled in the parent PR I mentioned three weeks ago. Can we all please review this? https://github.com/apache/iceberg/pull/16566
I can split into smaller PRs if required. On Thu, Jun 18, 2026 at 1:59 PM Manu Zhang <[email protected]> wrote: > Hi all, > > Here's another quick win from scoping Spark CI to only changed Spark > versions [1]. We usually open a PR first against the latest Spark version > and then back-port it to previous versions after the merge. Running Spark > CI for all Spark versions in such cases wastes resources. > > If this approach is approved, I can also make a PR for Flink CI. > > > 1. https://github.com/apache/iceberg/pull/16800 > > Thanks, > Manu > > On Sat, Jun 13, 2026 at 8:34 AM Abnob Doss <[email protected]> wrote: > >> Hi, >> >> A potential small win from the subproject side: the iceberg-rust Python >> bindings CI had ended up building the Rust bindings twice per run, due to >> an accidental interaction between a few changes over time. One-line fix: >> https://github.com/apache/iceberg-rust/pull/2636 >> >> Measured over the past 7 days, the duplicate build took a median of 8.4 >> min on Linux, 12.1 min on macOS, and 15.3 min on Windows, totaling about >> 2,400 runner-minutes across 207 job executions. After the fix the same step >> takes a few seconds. >> >> Thanks, >> Abanoub >> >> On Wednesday, June 3rd, 2026 at 9:49 AM, Bob Thomson <[email protected]> >> wrote: >> >> > I don't think we have data to that level of granularity, it's a case of >> looking at the Actions and their run time and frequency of execution in >> each of your repos, and focussing on the longest running and most frequent >> ones. That is, an Action run might only run for 5 minutes each time, but if >> it is running 400 times a day then that occupies more than one job slot of >> the toal of 900 ASF has, for the duration of that day. >> > Experience so far suggests those actions that build Java are often the >> most time consuming. >> > >> > Thanks. >> > >> > Kind regards, >> > -Bob Thomson. >> > >> > On 2026/06/01 18:39:38 Yufei Gu wrote: >> > > Hi Bob, >> > > >> > > Thanks for the heads-up and for giving the Iceberg community time to >> work >> > > on this. >> > > >> > > One question: Is the concern based on the overall GitHub Actions >> > > consumption of the Iceberg projects(e.g., main repo, python repo, go >> repo, >> > > etc), or only for the main Iceberg repository? Iceberg has multiple >> > > repositories, including the main repository as well as Python, Go, >> Rust, >> > > and C++ subprojects. Most of the discussion and optimization work in >> this >> > > thread focuses on the main repository, where the majority of CI usage >> > > occurs. If the overall project usage is within acceptable limits, >> would it >> > > be possible to allow a higher quota for a single repo (the Iceberg >> main >> > > repository), given its broader compatibility and integration testing >> > > requirements? >> > > >> > > Yufei >> > > >> > > >> > > On Mon, Jun 1, 2026 at 11:00 AM Steve Loughran <[email protected]> >> wrote: >> > > >> > > > This is really good for draft builds. >> > > > >> > > > If I'm committing and pushing work up to a WiP PR, it is often >> because I >> > > > want *a* machine to do the testing; I don't care who it runs as. >> > > > >> > > > Forcing PRs to run as the submitter also hardens the OSS repo >> against >> > > > vulnerabilities in the Github Actions and other parts of the build >> process. >> > > > >> > > > On Mon, 1 Jun 2026 at 17:11, Prashant Singh < >> [email protected]> >> > > > wrote: >> > > > >> > > >> Hi all, >> > > >> >> > > >> Great progress on the matrix reduction, incremental builds, and >> draft PR >> > > >> skipping ideas. I'd like to propose a complementary approach >> that can >> > > >> work >> > > >> alongside all of those: running PR CI on contributor fork compute >> > > >> instead >> > > >> of the ASF shared pool. >> > > >> >> > > >> How it works: >> > > >> >> > > >> Workflows switch from pull_request to push triggers on non-main >> > > >> branches. Each workflow: >> > > >> >> > > >> 1. Checks out apache/iceberg main (security boundary — untrusted >> code >> > > >> can't modify the workflow itself) >> > > >> 2. Squash-merges the contributor's fork branch on top >> > > >> 3. Runs tests on that merged tree >> > > >> >> > > >> Because the push event fires on the fork, GitHub bills the CI >> minutes >> > > >> to the fork owner's account - not the ASF shared pool. This takes >> > > >> Iceberg's PR CI usage from the ASF runners to effectively zero, >> > > >> regardless of matrix size. >> > > >> >> > > >> Why this is complementary: >> > > >> >> > > >> The optimizations discussed so far all reduce how much CI runs. >> > > >> Fork-compute changes where >> > > >> it runs. They compose - a leaner matrix running on fork compute >> is >> > > >> strictly better than either approach alone. >> > > >> >> > > >> Inline PR status: >> > > >> >> > > >> A lightweight notify_test_workflow.yml (using >> pull_request_target + >> > > >> Checks API) is included to post fork CI results directly onto the >> > > >> upstream PR's checks tab - so reviewers see green/red status >> inline as >> > > >> they do today. >> > > >> >> > > >> *Prior art*: >> > > >> >> > > >> Apache Spark adopted this pattern in 2024 (SPARK-47041) and has >> been >> > > >> running it in production since. Their full Spark CI matrix runs >> entirely >> > > >> on contributor forks. >> > > >> >> > > >> PR: https://github.com/apache/iceberg/pull/15397: covers all 10 >> > > >> workflow files. I've verified all workflows pass on fork >> computation. >> > > >> >> > > >> This could be merged independently of the matrix/incremental >> > > >> optimizations and would immediately eliminate PR CI pressure on >> the >> > > >> ASF pool - well within the June 8 deadline. >> > > >> >> > > >> Thoughts? >> > > >> >> > > >> Prashant Singh >> > > >> >> > > >> On Fri, May 29, 2026 at 8:47 PM Renjie Liu < >> [email protected]> >> > > >> wrote: >> > > >> >> > > >>> I like the idea of cutting supported jvm runs in each ci. JVM has >> great >> > > >>> backward compatibility, and we run on one jvm (maybe jvm 17) and >> trigger a >> > > >>> nightly run for jvm 21. >> > > >>> >> > > >>> On Wed, May 27, 2026 at 3:17 AM Steve Loughran < >> [email protected]> >> > > >>> wrote: >> > > >>> >> > > >>>> >> > > >>>> Doing a scan of the aws-sdk bundle.jar is halfway to an audit of >> the >> > > >>>> maven repo, with spark the other half. >> > > >>>> >> > > >>>> It seems to me that only PRs which go near >> gradle/libs.versions.toml >> > > >>>> are going to change dependences, so introduce new CVEs. >> > > >>>> >> > > >>>> There's the separate issue "CVEs are eternal" and all existing >> > > >>>> dependencies are collections of undiscovered/unreported cves. >> That's >> > > >>>> dependabot's homework, generally. >> > > >>>> >> > > >>>> >> > > >>>> On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> >> wrote: >> > > >>>> >> > > >>>>> Thanks everyone for the great ideas. >> > > >>>>> >> > > >>>>> Here's where we stand today with respect to ASF runner usage >> (taken >> > > >>>>> from the link [2] above): >> > > >>>>> GitHub Actions Build Time Used >> > > >>>>> - past 7 days total usage: 218,321 minutes >> > > >>>>> - past 5 days total usage: 120,241 minutes >> > > >>>>> >> > > >>>>> *This puts us below the hard ceiling for resource usage* as >> described >> > > >>>>> by https://infra.apache.org/github-actions-policy.html >> > > >>>>> >> > > >>>>> > The average number of minutes a project uses *per calendar >> week >> > > >>>>> MUST NOT exceed the equivalent of 25 full-time runners (250,000 >> minutes, or >> > > >>>>> 4,200 hours)*. >> > > >>>>> > The average number of minutes a project uses *in any >> consecutive >> > > >>>>> five-day period MUST NOT exceed the equivalent of 30 full-time >> runners >> > > >>>>> (216,000 minutes, or 3,600 hours)*. >> > > >>>>> >> > > >>>>> We should still make improvements wherever possible. >> > > >>>>> >> > > >>>>> I have a few PRs to reduce CI usage further. >> > > >>>>> - CI: Limit CVE scan runs to relevant changes #16513 >> > > >>>>> - Build: Simplify CI workflow path filters to avoid per-workflow >> > > >>>>> maintenance #16302 >> > > >>>>> >> > > >>>>> There are a couple of heuristics we can use >> > > >>>>> 1. Don't run CI if not needed. For example, `site/` dir changes >> > > >>>>> shouldn't trigger Spark/Flink/Java CI. This might be optimized >> already, but >> > > >>>>> we should double check just in case. >> > > >>>>> 2. If we must run CI, fail fast. For example, if there is a >> formatter >> > > >>>>> issue, fail all inflight CI tasks. >> > > >>>>> 3. Within a specific CI workflow, reduce the matrix wherever >> possible. >> > > >>>>> Do we really need to run all "Java versions" x "Scala versions" >> x "Spark >> > > >>>>> versions"? >> > > >>>>> 4. Improve individual CI tasks. Spark CI dominates 57% of all >> resource >> > > >>>>> usage. I have a tracking issue where I benchmarked where all >> that time is >> > > >>>>> spent. See https://github.com/apache/iceberg/issues/16397 >> > > >>>>> >> > > >>>>> Top CI tasks as % of resource use: >> > > >>>>> - Spark CI: 57.68% >> > > >>>>> - Flink CI: 13.60% >> > > >>>>> - Java CI: 7.02% >> > > >>>>> - CVE Scan: 3.13% >> > > >>>>> >> > > >>>>> Best, >> > > >>>>> Kevin Liu >> > > >>>>> >> > > >>>>> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat < >> [email protected]> >> > > >>>>> wrote: >> > > >>>>> >> > > >>>>>> Hi all, >> > > >>>>>> >> > > >>>>>> How about implementing the incremental PR builder? (similar to >> > > >>>>>> >> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder >> > > >>>>>> ) >> > > >>>>>> >> > > >>>>>> I think one of the main causes of GitHub runner pressure in >> Iceberg >> > > >>>>>> is the breadth of our CI matrix. We support multiple languages >> (java, >> > > >>>>>> python, go, rust, cpp) and integrations, and for Java we test >> across >> > > >>>>>> multiple JVM versions, Spark versions, Flink versions, Kafka, >> Hive/MR, >> > > >>>>>> REST/OpenAPI, runtime bundles, and more. That coverage is >> valuable, but >> > > >>>>>> running most of it for every PR is expensive and increases >> both runner >> > > >>>>>> usage and CI wall time. >> > > >>>>>> >> > > >>>>>> I think the biggest win can be achieved by having an >> incremental PR >> > > >>>>>> build. >> > > >>>>>> We already have useful building blocks for it: Gradle build >> cache, >> > > >>>>>> path filters, and version-selective build properties like >> -DsparkVersions >> > > >>>>>> and -DflinkVersions. >> > > >>>>>> >> > > >>>>>> The idea is to keep full coverage on main, release branches, >> tags, >> > > >>>>>> and global build changes, but make PR CI depend on the files >> changed: >> > > >>>>>> >> > > >>>>>> - Spark-only changes run Spark CI, not Flink/Hive/Kafka. >> > > >>>>>> - spark/v4.1/** changes run only Spark 4.1, not every Spark >> > > >>>>>> version. >> > > >>>>>> - flink/v2.0/** changes run only Flink 2.0, not every Flink >> > > >>>>>> version. >> > > >>>>>> - API/Core/Data/File format changes run the owning Java >> checks >> > > >>>>>> plus selected downstream canaries, such as latest Spark and >> latest Flink, >> > > >>>>>> instead of the full engine matrix. >> > > >>>>>> - Runtime/bundle CVE checks run only for affected runtime >> > > >>>>>> artifacts. >> > > >>>>>> - A full-ci label or global Gradle/workflow changes can >> still >> > > >>>>>> force the full matrix. >> > > >>>>>> >> > > >>>>>> >> > > >>>>>> Another possible optimization is JVM coverage. Today many PR >> jobs run >> > > >>>>>> across both Java 17 and Java 21. We could consider running one >> primary JVM >> > > >>>>>> for PRs, and reserve the full JVM matrix for main, release >> branches, >> > > >>>>>> nightly/scheduled builds, or PRs labeled full-ci. That would >> further reduce >> > > >>>>>> runner usage and PR wall time, while still preserving broad >> compatibility >> > > >>>>>> coverage before changes become part of the main branch. >> > > >>>>>> >> > > >>>>>> A practical approach could be: >> > > >>>>>> >> > > >>>>>> PRs: incremental module/version selection, mostly one JVM, plus >> > > >>>>>> targeted canaries. >> > > >>>>>> main: full matrix across JVMs, Spark versions, Flink versions, >> and >> > > >>>>>> runtime checks. >> > > >>>>>> Manual override: full-ci label for risky or cross-cutting PRs. >> > > >>>>>> >> > > >>>>>> This should reduce queue time, lower GitHub runner >> consumption, and >> > > >>>>>> give contributors faster feedback without giving up full >> coverage where it >> > > >>>>>> matters most. >> > > >>>>>> >> > > >>>>>> I am working on a POC >> https://github.com/apache/iceberg/pull/16566 >> > > >>>>>> Suggestions are welcome. >> > > >>>>>> >> > > >>>>>> - Ajantha >> > > >>>>>> >> > > >>>>>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao < >> [email protected]> >> > > >>>>>> wrote: >> > > >>>>>> >> > > >>>>>>> Hi Manu, >> > > >>>>>>> >> > > >>>>>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang < >> [email protected]> >> > > >>>>>>> wrote: >> > > >>>>>>> > >> > > >>>>>>> > Hi Junwang, >> > > >>>>>>> > >> > > >>>>>>> > Not sure about others but I usually only change status to >> "Ready >> > > >>>>>>> for review" when CI has passed. >> > > >>>>>>> >> > > >>>>>>> Yeah, I agree there are trade-offs to disabling gh actions >> for draft >> > > >>>>>>> PRs. >> > > >>>>>>> >> > > >>>>>>> Reasons to Disable: >> > > >>>>>>> >> > > >>>>>>> - Cost savings: large teams and monorepos can burn through >> GitHub >> > > >>>>>>> Actions minutes quickly. Skipping CI for draft PRs avoids >> spending >> > > >>>>>>> resources on code that may not even compile yet. >> > > >>>>>>> - Reduced noise: draft PRs are often used for experimentation >> or >> > > >>>>>>> work-in-progress changes. Disabling CI avoids cluttering the >> PR >> > > >>>>>>> timeline with transient failures while the author is still >> iterating. >> > > >>>>>>> - Better resource utilization: orgs with limited self-hosted >> runners >> > > >>>>>>> may prefer to prioritize "Ready for Review" PRs so >> > > >>>>>>> production-relevant >> > > >>>>>>> changes get feedback and merge capacity sooner. >> > > >>>>>>> >> > > >>>>>>> Reasons to Keep: >> > > >>>>>>> >> > > >>>>>>> - Early error detection: developers can use draft PRs as a >> sandbox to >> > > >>>>>>> validate builds and tests before requesting review. >> > > >>>>>>> - Self-correction: failed checks on a draft PR allow authors >> to fix >> > > >>>>>>> lint or test issues before involving reviewers. >> > > >>>>>>> - Higher review confidence: by the time a PR is marked "Ready >> for >> > > >>>>>>> Review", CI has often already passed at least once, leading >> to a >> > > >>>>>>> smoother review process. >> > > >>>>>>> >> > > >>>>>>> For myself, when I create a draft PR, I'm usually sharing >> early >> > > >>>>>>> work-in-progress code with other developers and may not have >> tested >> > > >>>>>>> it >> > > >>>>>>> thoroughly locally yet, so I sometimes prefer to disable CI. >> That's >> > > >>>>>>> just my personal preference though. >> > > >>>>>>> >> > > >>>>>>> > >> > > >>>>>>> > Regards, >> > > >>>>>>> > Manu >> > > >>>>>>> > >> > > >>>>>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao < >> [email protected]> >> > > >>>>>>> wrote: >> > > >>>>>>> >> >> > > >>>>>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao < >> [email protected]> >> > > >>>>>>> wrote: >> > > >>>>>>> >> > >> > > >>>>>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu < >> > > >>>>>>> [email protected]> wrote: >> > > >>>>>>> >> > > >> > > >>>>>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days >> ago. >> > > >>>>>>> It should reduce the Spark CI cost by ~25%. >> > > >>>>>>> >> > > >> > > >>>>>>> >> > > Some heavy-hitter test classes in Spark tests (core and >> > > >>>>>>> extension) cause high load due to parameter combinations. I >> asked AI to >> > > >>>>>>> analyze the build log and recommend changes offering the best >> ROI. Details >> > > >>>>>>> are in this doc. >> > > >>>>>>> >> > > >> > > >>>>>>> >> > > I can look into dropping some combinations without >> > > >>>>>>> sacrificing essential coverage. E.g., we can probably drop >> the Hadoop >> > > >>>>>>> catalog usage in test, as it wasn't recommended for >> production use anyway. >> > > >>>>>>> >> > >> > > >>>>>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI >> > > >>>>>>> resource >> > > >>>>>>> >> > usage a little bit. Perhaps we should apply the same >> approach >> > > >>>>>>> across >> > > >>>>>>> >> > all iceberg subprojects? >> > > >>>>>>> >> > >> > > >>>>>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680 >> > > >>>>>>> >> >> > > >>>>>>> >> I've created a PR to show that, see [1], since it's a >> draft, the >> > > >>>>>>> CI >> > > >>>>>>> >> won't run. If I click the `Ready for review` button, the >> actions >> > > >>>>>>> will >> > > >>>>>>> >> be triggered. Let me know what you think about it. >> > > >>>>>>> >> >> > > >>>>>>> >> [1] https://github.com/apache/iceberg/pull/16561 >> > > >>>>>>> >> >> > > >>>>>>> >> > >> > > >>>>>>> >> > > >> > > >>>>>>> >> > > >> > > >>>>>>> >> > > >> > > >>>>>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich < >> > > >>>>>>> [email protected]> wrote: >> > > >>>>>>> >> > >> >> > > >>>>>>> >> > >> Apache DataFusion similarly received this notice. For >> > > >>>>>>> visibility to the Iceberg community, we have tracking issues >> to try to >> > > >>>>>>> discuss solutions: >> > > >>>>>>> >> > >> >> > > >>>>>>> >> > >> https://github.com/apache/datafusion/issues/22455 >> > > >>>>>>> >> > >> >> https://github.com/apache/datafusion-comet/issues/4406 >> > > >>>>>>> >> > >> >> > > >>>>>>> >> > >> DataFusion Comet is consuming the vast majority of >> > > >>>>>>> DataFusion resources, and like the Iceberg project it's due >> to Spark tests >> > > >>>>>>> (and Iceberg's Spark tests). We are doing some analysis on >> what subsets >> > > >>>>>>> might be appropriate for our workflows, features, and goals, >> and will share >> > > >>>>>>> anything that we think might translate back to the Iceberg CI >> workflows. >> > > >>>>>>> >> > >> >> > > >>>>>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson < >> > > >>>>>>> [email protected]> wrote: >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> Hello, Iceberg PMC. >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> In 2024, the ASF introduced the policy for GitHub >> Actions >> > > >>>>>>> usage >> > > >>>>>>> >> > >>> across the foundation[1]. The ASF Github shared pool >> of >> > > >>>>>>> >> > >>> Github-hosted runners has been at, or very close to >> the >> > > >>>>>>> limit of >> > > >>>>>>> >> > >>> 900 jobs most of the time in the past few weeks and >> this is >> > > >>>>>>> the >> > > >>>>>>> >> > >>> case again today. >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> Your project has been identified as being among the >> top 5 >> > > >>>>>>> consumers of >> > > >>>>>>> >> > >>> build time over the past 7 days and we request that >> you >> > > >>>>>>> bring your >> > > >>>>>>> >> > >>> usage down by stream-lining long-running builds. >> Contact >> > > >>>>>>> Infra for >> > > >>>>>>> >> > >>> a consultation if you are unable to streamline your >> builds >> > > >>>>>>> further. >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> You can use the infra reporting tool[2] to monitor >> your GHA >> > > >>>>>>> usage as you >> > > >>>>>>> >> > >>> work on stream-lining, as well as locate any >> bottlenecks in >> > > >>>>>>> the workflows. >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> Infra will allow you two weeks time (till the 8th of >> June, >> > > >>>>>>> 2026) to >> > > >>>>>>> >> > >>> progress this, but should you still be above the >> limits by >> > > >>>>>>> then, >> > > >>>>>>> >> > >>> without a viable path forward, we will be limiting >> your GHA >> > > >>>>>>> usage. >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> Kind regards, >> > > >>>>>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure. >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >>> [1] >> https://infra.apache.org/github-actions-policy.html >> > > >>>>>>> >> > >>> [2] >> > > >>>>>>> >> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name >> > > >>>>>>> >> > >>> >> > > >>>>>>> >> > >> > > >>>>>>> >> > >> > > >>>>>>> >> > -- >> > > >>>>>>> >> > Regards >> > > >>>>>>> >> > Junwang Zhao >> > > >>>>>>> >> >> > > >>>>>>> >> >> > > >>>>>>> >> >> > > >>>>>>> >> -- >> > > >>>>>>> >> Regards >> > > >>>>>>> >> Junwang Zhao >> > > >>>>>>> >> > > >>>>>>> >> > > >>>>>>> >> > > >>>>>>> -- >> > > >>>>>>> Regards >> > > >>>>>>> Junwang Zhao >> > > >>>>>>> >> > > >>>>>> >> > > >> > >> >
