Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Yufei Gu Mon, 01 Jun 2026 11:41:27 -0700

Hi Bob,

Thanks for the heads-up and for giving the Iceberg community time to work
on this.


One question: Is the concern based on the overall GitHub Actions
consumption of the Iceberg projects(e.g., main repo, python repo, go repo,
etc), or only for the main Iceberg repository? Iceberg has multiple
repositories, including the main repository as well as Python, Go, Rust,
and C++ subprojects. Most of the discussion and optimization work in this
thread focuses on the main repository, where the majority of CI usage
occurs. If the overall project usage is within acceptable limits, would it
be possible to allow a higher quota for a single repo (the Iceberg main
repository), given its broader compatibility and integration testing
requirements?

Yufei


On Mon, Jun 1, 2026 at 11:00 AM Steve Loughran <[email protected]> wrote:

> This is really good for draft builds.
>
> If I'm committing and pushing work up to a WiP PR, it is often because I
> want *a* machine to do the testing; I don't care who it runs as.
>
> Forcing PRs to run as the submitter also hardens the OSS repo against
> vulnerabilities in the Github Actions and other parts of the build process.
>
> On Mon, 1 Jun 2026 at 17:11, Prashant Singh <[email protected]>
> wrote:
>
>>   Hi all,
>>
>>   Great progress on the matrix reduction, incremental builds, and draft PR
>>   skipping ideas. I'd like to propose a complementary approach that can
>> work
>>   alongside all of those: running PR CI on contributor fork compute
>> instead
>>   of the ASF shared pool.
>>
>>   How it works:
>>
>>   Workflows switch from pull_request to push triggers on non-main
>>   branches. Each workflow:
>>
>>   1. Checks out apache/iceberg main (security boundary — untrusted code
>>   can't modify the workflow itself)
>>   2. Squash-merges the contributor's fork branch on top
>>   3. Runs tests on that merged tree
>>
>>   Because the push event fires on the fork, GitHub bills the CI minutes
>>   to the fork owner's account - not the ASF shared pool. This takes
>>   Iceberg's PR CI usage from the ASF runners to effectively zero,
>>   regardless of matrix size.
>>
>>   Why this is complementary:
>>
>>   The optimizations discussed so far all reduce how much CI runs.
>> Fork-compute changes where
>>   it runs. They compose - a leaner matrix running on fork compute is
>>   strictly better than either approach alone.
>>
>>   Inline PR status:
>>
>>   A lightweight notify_test_workflow.yml (using pull_request_target +
>>   Checks API) is included to post fork CI results directly onto the
>>   upstream PR's checks tab - so reviewers see green/red status inline as
>>   they do today.
>>
>>   *Prior art*:
>>
>>   Apache Spark adopted this pattern in 2024 (SPARK-47041) and has been
>>   running it in production since. Their full Spark CI matrix runs entirely
>>   on contributor forks.
>>
>>   PR: https://github.com/apache/iceberg/pull/15397: covers all 10
>>   workflow files. I've verified all workflows pass on fork computation.
>>
>>   This could be merged independently of the matrix/incremental
>>   optimizations and would immediately eliminate PR CI pressure on the
>>   ASF pool - well within the June 8 deadline.
>>
>>   Thoughts?
>>
>> Prashant Singh
>>
>> On Fri, May 29, 2026 at 8:47 PM Renjie Liu <[email protected]>
>> wrote:
>>
>>> I like the idea of cutting supported jvm runs in each ci. JVM has great
>>> backward compatibility, and we run on one jvm (maybe jvm 17) and trigger a
>>> nightly run for jvm 21.
>>>
>>> On Wed, May 27, 2026 at 3:17 AM Steve Loughran <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Doing a scan of the aws-sdk bundle.jar is halfway to an audit of the
>>>> maven repo, with spark the other half.
>>>>
>>>> It seems to me that only PRs which go near gradle/libs.versions.toml
>>>> are going to change dependences, so introduce new CVEs.
>>>>
>>>> There's the separate issue "CVEs are eternal" and all existing
>>>> dependencies are collections of undiscovered/unreported cves. That's
>>>> dependabot's homework, generally.
>>>>
>>>>
>>>> On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> wrote:
>>>>
>>>>> Thanks everyone for the great ideas.
>>>>>
>>>>> Here's where we stand today with respect to ASF runner usage (taken
>>>>> from the link [2] above):
>>>>> GitHub Actions Build Time Used
>>>>> - past 7 days total usage: 218,321 minutes
>>>>> - past 5 days total usage: 120,241 minutes
>>>>>
>>>>> *This puts us below the hard ceiling for resource usage* as described
>>>>> by https://infra.apache.org/github-actions-policy.html
>>>>>
>>>>> > The average number of minutes a project uses *per calendar week
>>>>> MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, 
>>>>> or
>>>>> 4,200 hours)*.
>>>>> > The average number of minutes a project uses *in any consecutive
>>>>> five-day period MUST NOT exceed the equivalent of 30 full-time runners
>>>>> (216,000 minutes, or 3,600 hours)*.
>>>>>
>>>>> We should still make improvements wherever possible.
>>>>>
>>>>> I have a few PRs to reduce CI usage further.
>>>>> - CI: Limit CVE scan runs to relevant changes #16513
>>>>> - Build: Simplify CI workflow path filters to avoid per-workflow
>>>>> maintenance #16302
>>>>>
>>>>> There are a couple of heuristics we can use
>>>>> 1. Don't run CI if not needed. For example, `site/` dir changes
>>>>> shouldn't trigger Spark/Flink/Java CI. This might be optimized already, 
>>>>> but
>>>>> we should double check just in case.
>>>>> 2. If we must run CI, fail fast. For example, if there is a formatter
>>>>> issue, fail all inflight CI tasks.
>>>>> 3. Within a specific CI workflow, reduce the matrix wherever possible.
>>>>> Do we really need to run all "Java versions" x "Scala versions" x "Spark
>>>>> versions"?
>>>>> 4. Improve individual CI tasks. Spark CI dominates 57% of all resource
>>>>> usage. I have a tracking issue where I benchmarked where all that time is
>>>>> spent. See https://github.com/apache/iceberg/issues/16397
>>>>>
>>>>> Top CI tasks as % of resource use:
>>>>> - Spark CI: 57.68%
>>>>> - Flink CI: 13.60%
>>>>> - Java CI: 7.02%
>>>>> - CVE Scan: 3.13%
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> How about implementing the incremental PR builder? (similar to
>>>>>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder
>>>>>> )
>>>>>>
>>>>>> I think one of the main causes of GitHub runner pressure in Iceberg
>>>>>> is the breadth of our CI matrix. We support multiple languages (java,
>>>>>> python, go, rust, cpp) and integrations, and for Java we test across
>>>>>> multiple JVM versions, Spark versions, Flink versions, Kafka, Hive/MR,
>>>>>> REST/OpenAPI, runtime bundles, and more. That coverage is valuable, but
>>>>>> running most of it for every PR is expensive and increases both runner
>>>>>> usage and CI wall time.
>>>>>>
>>>>>> I think the biggest win can be achieved by having an incremental PR
>>>>>> build.
>>>>>> We already have useful building blocks for it: Gradle build cache,
>>>>>> path filters, and version-selective build properties like -DsparkVersions
>>>>>> and -DflinkVersions.
>>>>>>
>>>>>> The idea is to keep full coverage on main, release branches, tags,
>>>>>> and global build changes, but make PR CI depend on the files changed:
>>>>>>
>>>>>>    - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
>>>>>>    - spark/v4.1/** changes run only Spark 4.1, not every Spark
>>>>>>    version.
>>>>>>    - flink/v2.0/** changes run only Flink 2.0, not every Flink
>>>>>>    version.
>>>>>>    - API/Core/Data/File format changes run the owning Java checks
>>>>>>    plus selected downstream canaries, such as latest Spark and latest 
>>>>>> Flink,
>>>>>>    instead of the full engine matrix.
>>>>>>    - Runtime/bundle CVE checks run only for affected runtime
>>>>>>    artifacts.
>>>>>>    - A full-ci label or global Gradle/workflow changes can still
>>>>>>    force the full matrix.
>>>>>>
>>>>>>
>>>>>> Another possible optimization is JVM coverage. Today many PR jobs run
>>>>>> across both Java 17 and Java 21. We could consider running one primary 
>>>>>> JVM
>>>>>> for PRs, and reserve the full JVM matrix for main, release branches,
>>>>>> nightly/scheduled builds, or PRs labeled full-ci. That would further 
>>>>>> reduce
>>>>>> runner usage and PR wall time, while still preserving broad compatibility
>>>>>> coverage before changes become part of the main branch.
>>>>>>
>>>>>> A practical approach could be:
>>>>>>
>>>>>> PRs: incremental module/version selection, mostly one JVM, plus
>>>>>> targeted canaries.
>>>>>> main: full matrix across JVMs, Spark versions, Flink versions, and
>>>>>> runtime checks.
>>>>>> Manual override: full-ci label for risky or cross-cutting PRs.
>>>>>>
>>>>>> This should reduce queue time, lower GitHub runner consumption, and
>>>>>> give contributors faster feedback without giving up full coverage where 
>>>>>> it
>>>>>> matters most.
>>>>>>
>>>>>> I am working on a POC https://github.com/apache/iceberg/pull/16566
>>>>>> Suggestions are welcome.
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Manu,
>>>>>>>
>>>>>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi Junwang,
>>>>>>> >
>>>>>>> > Not sure about others but I usually only change status to "Ready
>>>>>>> for review"  when CI has passed.
>>>>>>>
>>>>>>> Yeah, I agree there are trade-offs to disabling gh actions for draft
>>>>>>> PRs.
>>>>>>>
>>>>>>> Reasons to Disable:
>>>>>>>
>>>>>>> - Cost savings: large teams and monorepos can burn through GitHub
>>>>>>> Actions minutes quickly. Skipping CI for draft PRs avoids spending
>>>>>>> resources on code that may not even compile yet.
>>>>>>> - Reduced noise: draft PRs are often used for experimentation or
>>>>>>> work-in-progress changes. Disabling CI avoids cluttering the PR
>>>>>>> timeline with transient failures while the author is still iterating.
>>>>>>> - Better resource utilization: orgs with limited self-hosted runners
>>>>>>> may prefer to prioritize "Ready for Review" PRs so
>>>>>>> production-relevant
>>>>>>> changes get feedback and merge capacity sooner.
>>>>>>>
>>>>>>> Reasons to Keep:
>>>>>>>
>>>>>>> - Early error detection: developers can use draft PRs as a sandbox to
>>>>>>> validate builds and tests before requesting review.
>>>>>>> - Self-correction: failed checks on a draft PR allow authors to fix
>>>>>>> lint or test issues before involving reviewers.
>>>>>>> - Higher review confidence: by the time a PR is marked "Ready for
>>>>>>> Review", CI has often already passed at least once, leading to a
>>>>>>> smoother review process.
>>>>>>>
>>>>>>> For myself, when I create a draft PR, I'm usually sharing early
>>>>>>> work-in-progress code with other developers and may not have tested
>>>>>>> it
>>>>>>> thoroughly locally yet, so I sometimes prefer to disable CI. That's
>>>>>>> just my personal preference though.
>>>>>>>
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> > Manu
>>>>>>> >
>>>>>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]>
>>>>>>> wrote:
>>>>>>> >> >
>>>>>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <
>>>>>>> [email protected]> wrote:
>>>>>>> >> > >
>>>>>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago.
>>>>>>> It should reduce the Spark CI cost by ~25%.
>>>>>>> >> > >
>>>>>>> >> > > Some heavy-hitter test classes in Spark tests (core and
>>>>>>> extension) cause high load due to parameter combinations. I asked AI to
>>>>>>> analyze the build log and recommend changes offering the best ROI. 
>>>>>>> Details
>>>>>>> are in this doc.
>>>>>>> >> > >
>>>>>>> >> > > I can look into dropping some combinations without
>>>>>>> sacrificing essential coverage. E.g., we can probably drop the Hadoop
>>>>>>> catalog usage in test, as it wasn't recommended for production use 
>>>>>>> anyway.
>>>>>>> >> >
>>>>>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI
>>>>>>> resource
>>>>>>> >> > usage a little bit. Perhaps we should apply the same approach
>>>>>>> across
>>>>>>> >> > all iceberg subprojects?
>>>>>>> >> >
>>>>>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
>>>>>>> >>
>>>>>>> >> I've created a PR to show that, see [1], since it's a draft, the
>>>>>>> CI
>>>>>>> >> won't run. If I click the `Ready for review` button, the actions
>>>>>>> will
>>>>>>> >> be triggered. Let me know what you think about it.
>>>>>>> >>
>>>>>>> >> [1] https://github.com/apache/iceberg/pull/16561
>>>>>>> >>
>>>>>>> >> >
>>>>>>> >> > >
>>>>>>> >> > >
>>>>>>> >> > >
>>>>>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
>>>>>>> [email protected]> wrote:
>>>>>>> >> > >>
>>>>>>> >> > >> Apache DataFusion similarly received this notice. For
>>>>>>> visibility to the Iceberg community, we have tracking issues to try to
>>>>>>> discuss solutions:
>>>>>>> >> > >>
>>>>>>> >> > >> https://github.com/apache/datafusion/issues/22455
>>>>>>> >> > >> https://github.com/apache/datafusion-comet/issues/4406
>>>>>>> >> > >>
>>>>>>> >> > >> DataFusion Comet is consuming the vast majority of
>>>>>>> DataFusion resources, and like the Iceberg project it's due to Spark 
>>>>>>> tests
>>>>>>> (and Iceberg's Spark tests). We are doing some analysis on what subsets
>>>>>>> might be appropriate for our workflows, features, and goals, and will 
>>>>>>> share
>>>>>>> anything that we think might translate back to the Iceberg CI workflows.
>>>>>>> >> > >>
>>>>>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
>>>>>>> [email protected]> wrote:
>>>>>>> >> > >>>
>>>>>>> >> > >>> Hello, Iceberg PMC.
>>>>>>> >> > >>>
>>>>>>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions
>>>>>>> usage
>>>>>>> >> > >>> across the foundation[1]. The ASF Github shared pool of
>>>>>>> >> > >>> Github-hosted runners has been at, or very close to the
>>>>>>> limit of
>>>>>>> >> > >>> 900 jobs most of the time in the past few weeks and this is
>>>>>>> the
>>>>>>> >> > >>> case again today.
>>>>>>> >> > >>>
>>>>>>> >> > >>> Your project has been identified as being among the top 5
>>>>>>> consumers of
>>>>>>> >> > >>> build time over the past 7 days and we request that you
>>>>>>> bring your
>>>>>>> >> > >>> usage down by stream-lining long-running builds. Contact
>>>>>>> Infra for
>>>>>>> >> > >>> a consultation if you are unable to streamline your builds
>>>>>>> further.
>>>>>>> >> > >>>
>>>>>>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA
>>>>>>> usage as you
>>>>>>> >> > >>> work on stream-lining, as well as locate any bottlenecks in
>>>>>>> the workflows.
>>>>>>> >> > >>>
>>>>>>> >> > >>> Infra will allow you two weeks time (till the 8th of June,
>>>>>>> 2026) to
>>>>>>> >> > >>> progress this, but should you still be above the limits by
>>>>>>> then,
>>>>>>> >> > >>> without a viable path forward, we will be limiting your GHA
>>>>>>> usage.
>>>>>>> >> > >>>
>>>>>>> >> > >>> Kind regards,
>>>>>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
>>>>>>> >> > >>>
>>>>>>> >> > >>>
>>>>>>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html
>>>>>>> >> > >>> [2]
>>>>>>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
>>>>>>> >> > >>>
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > --
>>>>>>> >> > Regards
>>>>>>> >> > Junwang Zhao
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Regards
>>>>>>> >> Junwang Zhao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>> Junwang Zhao
>>>>>>>
>>>>>>

Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Reply via email to