Hi all, Great progress on the matrix reduction, incremental builds, and draft PR skipping ideas. I'd like to propose a complementary approach that can work alongside all of those: running PR CI on contributor fork compute instead of the ASF shared pool.
How it works: Workflows switch from pull_request to push triggers on non-main branches. Each workflow: 1. Checks out apache/iceberg main (security boundary — untrusted code can't modify the workflow itself) 2. Squash-merges the contributor's fork branch on top 3. Runs tests on that merged tree Because the push event fires on the fork, GitHub bills the CI minutes to the fork owner's account - not the ASF shared pool. This takes Iceberg's PR CI usage from the ASF runners to effectively zero, regardless of matrix size. Why this is complementary: The optimizations discussed so far all reduce how much CI runs. Fork-compute changes where it runs. They compose - a leaner matrix running on fork compute is strictly better than either approach alone. Inline PR status: A lightweight notify_test_workflow.yml (using pull_request_target + Checks API) is included to post fork CI results directly onto the upstream PR's checks tab - so reviewers see green/red status inline as they do today. *Prior art*: Apache Spark adopted this pattern in 2024 (SPARK-47041) and has been running it in production since. Their full Spark CI matrix runs entirely on contributor forks. PR: https://github.com/apache/iceberg/pull/15397: covers all 10 workflow files. I've verified all workflows pass on fork computation. This could be merged independently of the matrix/incremental optimizations and would immediately eliminate PR CI pressure on the ASF pool - well within the June 8 deadline. Thoughts? Prashant Singh On Fri, May 29, 2026 at 8:47 PM Renjie Liu <[email protected]> wrote: > I like the idea of cutting supported jvm runs in each ci. JVM has great > backward compatibility, and we run on one jvm (maybe jvm 17) and trigger a > nightly run for jvm 21. > > On Wed, May 27, 2026 at 3:17 AM Steve Loughran <[email protected]> > wrote: > >> >> Doing a scan of the aws-sdk bundle.jar is halfway to an audit of the >> maven repo, with spark the other half. >> >> It seems to me that only PRs which go near gradle/libs.versions.toml are >> going to change dependences, so introduce new CVEs. >> >> There's the separate issue "CVEs are eternal" and all existing >> dependencies are collections of undiscovered/unreported cves. That's >> dependabot's homework, generally. >> >> >> On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> wrote: >> >>> Thanks everyone for the great ideas. >>> >>> Here's where we stand today with respect to ASF runner usage (taken from >>> the link [2] above): >>> GitHub Actions Build Time Used >>> - past 7 days total usage: 218,321 minutes >>> - past 5 days total usage: 120,241 minutes >>> >>> *This puts us below the hard ceiling for resource usage* as described >>> by https://infra.apache.org/github-actions-policy.html >>> >>> > The average number of minutes a project uses *per calendar week MUST >>> NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or >>> 4,200 hours)*. >>> > The average number of minutes a project uses *in any consecutive >>> five-day period MUST NOT exceed the equivalent of 30 full-time runners >>> (216,000 minutes, or 3,600 hours)*. >>> >>> We should still make improvements wherever possible. >>> >>> I have a few PRs to reduce CI usage further. >>> - CI: Limit CVE scan runs to relevant changes #16513 >>> - Build: Simplify CI workflow path filters to avoid per-workflow >>> maintenance #16302 >>> >>> There are a couple of heuristics we can use >>> 1. Don't run CI if not needed. For example, `site/` dir changes >>> shouldn't trigger Spark/Flink/Java CI. This might be optimized already, but >>> we should double check just in case. >>> 2. If we must run CI, fail fast. For example, if there is a formatter >>> issue, fail all inflight CI tasks. >>> 3. Within a specific CI workflow, reduce the matrix wherever possible. >>> Do we really need to run all "Java versions" x "Scala versions" x "Spark >>> versions"? >>> 4. Improve individual CI tasks. Spark CI dominates 57% of all resource >>> usage. I have a tracking issue where I benchmarked where all that time is >>> spent. See https://github.com/apache/iceberg/issues/16397 >>> >>> Top CI tasks as % of resource use: >>> - Spark CI: 57.68% >>> - Flink CI: 13.60% >>> - Java CI: 7.02% >>> - CVE Scan: 3.13% >>> >>> Best, >>> Kevin Liu >>> >>> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> How about implementing the incremental PR builder? (similar to >>>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder >>>> ) >>>> >>>> I think one of the main causes of GitHub runner pressure in Iceberg is >>>> the breadth of our CI matrix. We support multiple languages (java, python, >>>> go, rust, cpp) and integrations, and for Java we test across multiple JVM >>>> versions, Spark versions, Flink versions, Kafka, Hive/MR, REST/OpenAPI, >>>> runtime bundles, and more. That coverage is valuable, but running most of >>>> it for every PR is expensive and increases both runner usage and CI wall >>>> time. >>>> >>>> I think the biggest win can be achieved by having an incremental PR >>>> build. >>>> We already have useful building blocks for it: Gradle build cache, path >>>> filters, and version-selective build properties like -DsparkVersions and >>>> -DflinkVersions. >>>> >>>> The idea is to keep full coverage on main, release branches, tags, and >>>> global build changes, but make PR CI depend on the files changed: >>>> >>>> - Spark-only changes run Spark CI, not Flink/Hive/Kafka. >>>> - spark/v4.1/** changes run only Spark 4.1, not every Spark version. >>>> - flink/v2.0/** changes run only Flink 2.0, not every Flink version. >>>> - API/Core/Data/File format changes run the owning Java checks plus >>>> selected downstream canaries, such as latest Spark and latest Flink, >>>> instead of the full engine matrix. >>>> - Runtime/bundle CVE checks run only for affected runtime artifacts. >>>> - A full-ci label or global Gradle/workflow changes can still force >>>> the full matrix. >>>> >>>> >>>> Another possible optimization is JVM coverage. Today many PR jobs run >>>> across both Java 17 and Java 21. We could consider running one primary JVM >>>> for PRs, and reserve the full JVM matrix for main, release branches, >>>> nightly/scheduled builds, or PRs labeled full-ci. That would further reduce >>>> runner usage and PR wall time, while still preserving broad compatibility >>>> coverage before changes become part of the main branch. >>>> >>>> A practical approach could be: >>>> >>>> PRs: incremental module/version selection, mostly one JVM, plus >>>> targeted canaries. >>>> main: full matrix across JVMs, Spark versions, Flink versions, and >>>> runtime checks. >>>> Manual override: full-ci label for risky or cross-cutting PRs. >>>> >>>> This should reduce queue time, lower GitHub runner consumption, and >>>> give contributors faster feedback without giving up full coverage where it >>>> matters most. >>>> >>>> I am working on a POC https://github.com/apache/iceberg/pull/16566 >>>> Suggestions are welcome. >>>> >>>> - Ajantha >>>> >>>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]> wrote: >>>> >>>>> Hi Manu, >>>>> >>>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]> >>>>> wrote: >>>>> > >>>>> > Hi Junwang, >>>>> > >>>>> > Not sure about others but I usually only change status to "Ready for >>>>> review" when CI has passed. >>>>> >>>>> Yeah, I agree there are trade-offs to disabling gh actions for draft >>>>> PRs. >>>>> >>>>> Reasons to Disable: >>>>> >>>>> - Cost savings: large teams and monorepos can burn through GitHub >>>>> Actions minutes quickly. Skipping CI for draft PRs avoids spending >>>>> resources on code that may not even compile yet. >>>>> - Reduced noise: draft PRs are often used for experimentation or >>>>> work-in-progress changes. Disabling CI avoids cluttering the PR >>>>> timeline with transient failures while the author is still iterating. >>>>> - Better resource utilization: orgs with limited self-hosted runners >>>>> may prefer to prioritize "Ready for Review" PRs so production-relevant >>>>> changes get feedback and merge capacity sooner. >>>>> >>>>> Reasons to Keep: >>>>> >>>>> - Early error detection: developers can use draft PRs as a sandbox to >>>>> validate builds and tests before requesting review. >>>>> - Self-correction: failed checks on a draft PR allow authors to fix >>>>> lint or test issues before involving reviewers. >>>>> - Higher review confidence: by the time a PR is marked "Ready for >>>>> Review", CI has often already passed at least once, leading to a >>>>> smoother review process. >>>>> >>>>> For myself, when I create a draft PR, I'm usually sharing early >>>>> work-in-progress code with other developers and may not have tested it >>>>> thoroughly locally yet, so I sometimes prefer to disable CI. That's >>>>> just my personal preference though. >>>>> >>>>> > >>>>> > Regards, >>>>> > Manu >>>>> > >>>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]> >>>>> wrote: >>>>> >> > >>>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <[email protected]> >>>>> wrote: >>>>> >> > > >>>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago. It >>>>> should reduce the Spark CI cost by ~25%. >>>>> >> > > >>>>> >> > > Some heavy-hitter test classes in Spark tests (core and >>>>> extension) cause high load due to parameter combinations. I asked AI to >>>>> analyze the build log and recommend changes offering the best ROI. Details >>>>> are in this doc. >>>>> >> > > >>>>> >> > > I can look into dropping some combinations without sacrificing >>>>> essential coverage. E.g., we can probably drop the Hadoop catalog usage in >>>>> test, as it wasn't recommended for production use anyway. >>>>> >> > >>>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI resource >>>>> >> > usage a little bit. Perhaps we should apply the same approach >>>>> across >>>>> >> > all iceberg subprojects? >>>>> >> > >>>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680 >>>>> >> >>>>> >> I've created a PR to show that, see [1], since it's a draft, the CI >>>>> >> won't run. If I click the `Ready for review` button, the actions >>>>> will >>>>> >> be triggered. Let me know what you think about it. >>>>> >> >>>>> >> [1] https://github.com/apache/iceberg/pull/16561 >>>>> >> >>>>> >> > >>>>> >> > > >>>>> >> > > >>>>> >> > > >>>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich < >>>>> [email protected]> wrote: >>>>> >> > >> >>>>> >> > >> Apache DataFusion similarly received this notice. For >>>>> visibility to the Iceberg community, we have tracking issues to try to >>>>> discuss solutions: >>>>> >> > >> >>>>> >> > >> https://github.com/apache/datafusion/issues/22455 >>>>> >> > >> https://github.com/apache/datafusion-comet/issues/4406 >>>>> >> > >> >>>>> >> > >> DataFusion Comet is consuming the vast majority of DataFusion >>>>> resources, and like the Iceberg project it's due to Spark tests (and >>>>> Iceberg's Spark tests). We are doing some analysis on what subsets might >>>>> be >>>>> appropriate for our workflows, features, and goals, and will share >>>>> anything >>>>> that we think might translate back to the Iceberg CI workflows. >>>>> >> > >> >>>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson < >>>>> [email protected]> wrote: >>>>> >> > >>> >>>>> >> > >>> Hello, Iceberg PMC. >>>>> >> > >>> >>>>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions >>>>> usage >>>>> >> > >>> across the foundation[1]. The ASF Github shared pool of >>>>> >> > >>> Github-hosted runners has been at, or very close to the limit >>>>> of >>>>> >> > >>> 900 jobs most of the time in the past few weeks and this is >>>>> the >>>>> >> > >>> case again today. >>>>> >> > >>> >>>>> >> > >>> Your project has been identified as being among the top 5 >>>>> consumers of >>>>> >> > >>> build time over the past 7 days and we request that you bring >>>>> your >>>>> >> > >>> usage down by stream-lining long-running builds. Contact >>>>> Infra for >>>>> >> > >>> a consultation if you are unable to streamline your builds >>>>> further. >>>>> >> > >>> >>>>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA >>>>> usage as you >>>>> >> > >>> work on stream-lining, as well as locate any bottlenecks in >>>>> the workflows. >>>>> >> > >>> >>>>> >> > >>> Infra will allow you two weeks time (till the 8th of June, >>>>> 2026) to >>>>> >> > >>> progress this, but should you still be above the limits by >>>>> then, >>>>> >> > >>> without a viable path forward, we will be limiting your GHA >>>>> usage. >>>>> >> > >>> >>>>> >> > >>> Kind regards, >>>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure. >>>>> >> > >>> >>>>> >> > >>> >>>>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html >>>>> >> > >>> [2] >>>>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name >>>>> >> > >>> >>>>> >> > >>>>> >> > >>>>> >> > -- >>>>> >> > Regards >>>>> >> > Junwang Zhao >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Regards >>>>> >> Junwang Zhao >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> Junwang Zhao >>>>> >>>>
