Hi Manu, all of these were handled in the parent PR I mentioned three weeks
ago.
Can we all please review this? https://github.com/apache/iceberg/pull/16566

I can split into smaller PRs if required.

On Thu, Jun 18, 2026 at 1:59 PM Manu Zhang <[email protected]> wrote:

> Hi all,
>
> Here's another quick win from scoping Spark CI to only changed Spark
> versions [1]. We usually open a PR first against the latest Spark version
> and then back-port it to previous versions after the merge. Running Spark
> CI for all Spark versions in such cases wastes resources.
>
> If this approach is approved, I can also make a PR for Flink CI.
>
>
> 1. https://github.com/apache/iceberg/pull/16800
>
> Thanks,
> Manu
>
> On Sat, Jun 13, 2026 at 8:34 AM Abnob Doss <[email protected]> wrote:
>
>> Hi,
>>
>> A potential small win from the subproject side: the iceberg-rust Python
>> bindings CI had ended up building the Rust bindings twice per run, due to
>> an accidental interaction between a few changes over time. One-line fix:
>> https://github.com/apache/iceberg-rust/pull/2636
>>
>> Measured over the past 7 days, the duplicate build took a median of 8.4
>> min on Linux, 12.1 min on macOS, and 15.3 min on Windows, totaling about
>> 2,400 runner-minutes across 207 job executions. After the fix the same step
>> takes a few seconds.
>>
>> Thanks,
>> Abanoub
>>
>> On Wednesday, June 3rd, 2026 at 9:49 AM, Bob Thomson <[email protected]>
>> wrote:
>>
>> > I don't think we have data to that level of granularity, it's a case of
>> looking at the Actions and their run time and frequency of execution in
>> each of your repos, and focussing on the longest running and most frequent
>> ones. That is, an Action run might only run for 5 minutes each time, but if
>> it is running 400 times a day then that occupies more than one job slot of
>> the toal of 900 ASF has, for the duration of that day.
>> > Experience so far suggests those actions that build Java are often the
>> most time consuming.
>> >
>> > Thanks.
>> >
>> > Kind regards,
>> > -Bob Thomson.
>> >
>> > On 2026/06/01 18:39:38 Yufei Gu wrote:
>> > > Hi Bob,
>> > >
>> > > Thanks for the heads-up and for giving the Iceberg community time to
>> work
>> > > on this.
>> > >
>> > > One question: Is the concern based on the overall GitHub Actions
>> > > consumption of the Iceberg projects(e.g., main repo, python repo, go
>> repo,
>> > > etc), or only for the main Iceberg repository? Iceberg has multiple
>> > > repositories, including the main repository as well as Python, Go,
>> Rust,
>> > > and C++ subprojects. Most of the discussion and optimization work in
>> this
>> > > thread focuses on the main repository, where the majority of CI usage
>> > > occurs. If the overall project usage is within acceptable limits,
>> would it
>> > > be possible to allow a higher quota for a single repo (the Iceberg
>> main
>> > > repository), given its broader compatibility and integration testing
>> > > requirements?
>> > >
>> > > Yufei
>> > >
>> > >
>> > > On Mon, Jun 1, 2026 at 11:00 AM Steve Loughran <[email protected]>
>> wrote:
>> > >
>> > > > This is really good for draft builds.
>> > > >
>> > > > If I'm committing and pushing work up to a WiP PR, it is often
>> because I
>> > > > want *a* machine to do the testing; I don't care who it runs as.
>> > > >
>> > > > Forcing PRs to run as the submitter also hardens the OSS repo
>> against
>> > > > vulnerabilities in the Github Actions and other parts of the build
>> process.
>> > > >
>> > > > On Mon, 1 Jun 2026 at 17:11, Prashant Singh <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > >>   Hi all,
>> > > >>
>> > > >>   Great progress on the matrix reduction, incremental builds, and
>> draft PR
>> > > >>   skipping ideas. I'd like to propose a complementary approach
>> that can
>> > > >> work
>> > > >>   alongside all of those: running PR CI on contributor fork compute
>> > > >> instead
>> > > >>   of the ASF shared pool.
>> > > >>
>> > > >>   How it works:
>> > > >>
>> > > >>   Workflows switch from pull_request to push triggers on non-main
>> > > >>   branches. Each workflow:
>> > > >>
>> > > >>   1. Checks out apache/iceberg main (security boundary — untrusted
>> code
>> > > >>   can't modify the workflow itself)
>> > > >>   2. Squash-merges the contributor's fork branch on top
>> > > >>   3. Runs tests on that merged tree
>> > > >>
>> > > >>   Because the push event fires on the fork, GitHub bills the CI
>> minutes
>> > > >>   to the fork owner's account - not the ASF shared pool. This takes
>> > > >>   Iceberg's PR CI usage from the ASF runners to effectively zero,
>> > > >>   regardless of matrix size.
>> > > >>
>> > > >>   Why this is complementary:
>> > > >>
>> > > >>   The optimizations discussed so far all reduce how much CI runs.
>> > > >> Fork-compute changes where
>> > > >>   it runs. They compose - a leaner matrix running on fork compute
>> is
>> > > >>   strictly better than either approach alone.
>> > > >>
>> > > >>   Inline PR status:
>> > > >>
>> > > >>   A lightweight notify_test_workflow.yml (using
>> pull_request_target +
>> > > >>   Checks API) is included to post fork CI results directly onto the
>> > > >>   upstream PR's checks tab - so reviewers see green/red status
>> inline as
>> > > >>   they do today.
>> > > >>
>> > > >>   *Prior art*:
>> > > >>
>> > > >>   Apache Spark adopted this pattern in 2024 (SPARK-47041) and has
>> been
>> > > >>   running it in production since. Their full Spark CI matrix runs
>> entirely
>> > > >>   on contributor forks.
>> > > >>
>> > > >>   PR: https://github.com/apache/iceberg/pull/15397: covers all 10
>> > > >>   workflow files. I've verified all workflows pass on fork
>> computation.
>> > > >>
>> > > >>   This could be merged independently of the matrix/incremental
>> > > >>   optimizations and would immediately eliminate PR CI pressure on
>> the
>> > > >>   ASF pool - well within the June 8 deadline.
>> > > >>
>> > > >>   Thoughts?
>> > > >>
>> > > >> Prashant Singh
>> > > >>
>> > > >> On Fri, May 29, 2026 at 8:47 PM Renjie Liu <
>> [email protected]>
>> > > >> wrote:
>> > > >>
>> > > >>> I like the idea of cutting supported jvm runs in each ci. JVM has
>> great
>> > > >>> backward compatibility, and we run on one jvm (maybe jvm 17) and
>> trigger a
>> > > >>> nightly run for jvm 21.
>> > > >>>
>> > > >>> On Wed, May 27, 2026 at 3:17 AM Steve Loughran <
>> [email protected]>
>> > > >>> wrote:
>> > > >>>
>> > > >>>>
>> > > >>>> Doing a scan of the aws-sdk bundle.jar is halfway to an audit of
>> the
>> > > >>>> maven repo, with spark the other half.
>> > > >>>>
>> > > >>>> It seems to me that only PRs which go near
>> gradle/libs.versions.toml
>> > > >>>> are going to change dependences, so introduce new CVEs.
>> > > >>>>
>> > > >>>> There's the separate issue "CVEs are eternal" and all existing
>> > > >>>> dependencies are collections of undiscovered/unreported cves.
>> That's
>> > > >>>> dependabot's homework, generally.
>> > > >>>>
>> > > >>>>
>> > > >>>> On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]>
>> wrote:
>> > > >>>>
>> > > >>>>> Thanks everyone for the great ideas.
>> > > >>>>>
>> > > >>>>> Here's where we stand today with respect to ASF runner usage
>> (taken
>> > > >>>>> from the link [2] above):
>> > > >>>>> GitHub Actions Build Time Used
>> > > >>>>> - past 7 days total usage: 218,321 minutes
>> > > >>>>> - past 5 days total usage: 120,241 minutes
>> > > >>>>>
>> > > >>>>> *This puts us below the hard ceiling for resource usage* as
>> described
>> > > >>>>> by https://infra.apache.org/github-actions-policy.html
>> > > >>>>>
>> > > >>>>> > The average number of minutes a project uses *per calendar
>> week
>> > > >>>>> MUST NOT exceed the equivalent of 25 full-time runners (250,000
>> minutes, or
>> > > >>>>> 4,200 hours)*.
>> > > >>>>> > The average number of minutes a project uses *in any
>> consecutive
>> > > >>>>> five-day period MUST NOT exceed the equivalent of 30 full-time
>> runners
>> > > >>>>> (216,000 minutes, or 3,600 hours)*.
>> > > >>>>>
>> > > >>>>> We should still make improvements wherever possible.
>> > > >>>>>
>> > > >>>>> I have a few PRs to reduce CI usage further.
>> > > >>>>> - CI: Limit CVE scan runs to relevant changes #16513
>> > > >>>>> - Build: Simplify CI workflow path filters to avoid per-workflow
>> > > >>>>> maintenance #16302
>> > > >>>>>
>> > > >>>>> There are a couple of heuristics we can use
>> > > >>>>> 1. Don't run CI if not needed. For example, `site/` dir changes
>> > > >>>>> shouldn't trigger Spark/Flink/Java CI. This might be optimized
>> already, but
>> > > >>>>> we should double check just in case.
>> > > >>>>> 2. If we must run CI, fail fast. For example, if there is a
>> formatter
>> > > >>>>> issue, fail all inflight CI tasks.
>> > > >>>>> 3. Within a specific CI workflow, reduce the matrix wherever
>> possible.
>> > > >>>>> Do we really need to run all "Java versions" x "Scala versions"
>> x "Spark
>> > > >>>>> versions"?
>> > > >>>>> 4. Improve individual CI tasks. Spark CI dominates 57% of all
>> resource
>> > > >>>>> usage. I have a tracking issue where I benchmarked where all
>> that time is
>> > > >>>>> spent. See https://github.com/apache/iceberg/issues/16397
>> > > >>>>>
>> > > >>>>> Top CI tasks as % of resource use:
>> > > >>>>> - Spark CI: 57.68%
>> > > >>>>> - Flink CI: 13.60%
>> > > >>>>> - Java CI: 7.02%
>> > > >>>>> - CVE Scan: 3.13%
>> > > >>>>>
>> > > >>>>> Best,
>> > > >>>>> Kevin Liu
>> > > >>>>>
>> > > >>>>> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <
>> [email protected]>
>> > > >>>>> wrote:
>> > > >>>>>
>> > > >>>>>> Hi all,
>> > > >>>>>>
>> > > >>>>>> How about implementing the incremental PR builder? (similar to
>> > > >>>>>>
>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder
>> > > >>>>>> )
>> > > >>>>>>
>> > > >>>>>> I think one of the main causes of GitHub runner pressure in
>> Iceberg
>> > > >>>>>> is the breadth of our CI matrix. We support multiple languages
>> (java,
>> > > >>>>>> python, go, rust, cpp) and integrations, and for Java we test
>> across
>> > > >>>>>> multiple JVM versions, Spark versions, Flink versions, Kafka,
>> Hive/MR,
>> > > >>>>>> REST/OpenAPI, runtime bundles, and more. That coverage is
>> valuable, but
>> > > >>>>>> running most of it for every PR is expensive and increases
>> both runner
>> > > >>>>>> usage and CI wall time.
>> > > >>>>>>
>> > > >>>>>> I think the biggest win can be achieved by having an
>> incremental PR
>> > > >>>>>> build.
>> > > >>>>>> We already have useful building blocks for it: Gradle build
>> cache,
>> > > >>>>>> path filters, and version-selective build properties like
>> -DsparkVersions
>> > > >>>>>> and -DflinkVersions.
>> > > >>>>>>
>> > > >>>>>> The idea is to keep full coverage on main, release branches,
>> tags,
>> > > >>>>>> and global build changes, but make PR CI depend on the files
>> changed:
>> > > >>>>>>
>> > > >>>>>>    - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
>> > > >>>>>>    - spark/v4.1/** changes run only Spark 4.1, not every Spark
>> > > >>>>>>    version.
>> > > >>>>>>    - flink/v2.0/** changes run only Flink 2.0, not every Flink
>> > > >>>>>>    version.
>> > > >>>>>>    - API/Core/Data/File format changes run the owning Java
>> checks
>> > > >>>>>>    plus selected downstream canaries, such as latest Spark and
>> latest Flink,
>> > > >>>>>>    instead of the full engine matrix.
>> > > >>>>>>    - Runtime/bundle CVE checks run only for affected runtime
>> > > >>>>>>    artifacts.
>> > > >>>>>>    - A full-ci label or global Gradle/workflow changes can
>> still
>> > > >>>>>>    force the full matrix.
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> Another possible optimization is JVM coverage. Today many PR
>> jobs run
>> > > >>>>>> across both Java 17 and Java 21. We could consider running one
>> primary JVM
>> > > >>>>>> for PRs, and reserve the full JVM matrix for main, release
>> branches,
>> > > >>>>>> nightly/scheduled builds, or PRs labeled full-ci. That would
>> further reduce
>> > > >>>>>> runner usage and PR wall time, while still preserving broad
>> compatibility
>> > > >>>>>> coverage before changes become part of the main branch.
>> > > >>>>>>
>> > > >>>>>> A practical approach could be:
>> > > >>>>>>
>> > > >>>>>> PRs: incremental module/version selection, mostly one JVM, plus
>> > > >>>>>> targeted canaries.
>> > > >>>>>> main: full matrix across JVMs, Spark versions, Flink versions,
>> and
>> > > >>>>>> runtime checks.
>> > > >>>>>> Manual override: full-ci label for risky or cross-cutting PRs.
>> > > >>>>>>
>> > > >>>>>> This should reduce queue time, lower GitHub runner
>> consumption, and
>> > > >>>>>> give contributors faster feedback without giving up full
>> coverage where it
>> > > >>>>>> matters most.
>> > > >>>>>>
>> > > >>>>>> I am working on a POC
>> https://github.com/apache/iceberg/pull/16566
>> > > >>>>>> Suggestions are welcome.
>> > > >>>>>>
>> > > >>>>>> - Ajantha
>> > > >>>>>>
>> > > >>>>>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <
>> [email protected]>
>> > > >>>>>> wrote:
>> > > >>>>>>
>> > > >>>>>>> Hi Manu,
>> > > >>>>>>>
>> > > >>>>>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <
>> [email protected]>
>> > > >>>>>>> wrote:
>> > > >>>>>>> >
>> > > >>>>>>> > Hi Junwang,
>> > > >>>>>>> >
>> > > >>>>>>> > Not sure about others but I usually only change status to
>> "Ready
>> > > >>>>>>> for review"  when CI has passed.
>> > > >>>>>>>
>> > > >>>>>>> Yeah, I agree there are trade-offs to disabling gh actions
>> for draft
>> > > >>>>>>> PRs.
>> > > >>>>>>>
>> > > >>>>>>> Reasons to Disable:
>> > > >>>>>>>
>> > > >>>>>>> - Cost savings: large teams and monorepos can burn through
>> GitHub
>> > > >>>>>>> Actions minutes quickly. Skipping CI for draft PRs avoids
>> spending
>> > > >>>>>>> resources on code that may not even compile yet.
>> > > >>>>>>> - Reduced noise: draft PRs are often used for experimentation
>> or
>> > > >>>>>>> work-in-progress changes. Disabling CI avoids cluttering the
>> PR
>> > > >>>>>>> timeline with transient failures while the author is still
>> iterating.
>> > > >>>>>>> - Better resource utilization: orgs with limited self-hosted
>> runners
>> > > >>>>>>> may prefer to prioritize "Ready for Review" PRs so
>> > > >>>>>>> production-relevant
>> > > >>>>>>> changes get feedback and merge capacity sooner.
>> > > >>>>>>>
>> > > >>>>>>> Reasons to Keep:
>> > > >>>>>>>
>> > > >>>>>>> - Early error detection: developers can use draft PRs as a
>> sandbox to
>> > > >>>>>>> validate builds and tests before requesting review.
>> > > >>>>>>> - Self-correction: failed checks on a draft PR allow authors
>> to fix
>> > > >>>>>>> lint or test issues before involving reviewers.
>> > > >>>>>>> - Higher review confidence: by the time a PR is marked "Ready
>> for
>> > > >>>>>>> Review", CI has often already passed at least once, leading
>> to a
>> > > >>>>>>> smoother review process.
>> > > >>>>>>>
>> > > >>>>>>> For myself, when I create a draft PR, I'm usually sharing
>> early
>> > > >>>>>>> work-in-progress code with other developers and may not have
>> tested
>> > > >>>>>>> it
>> > > >>>>>>> thoroughly locally yet, so I sometimes prefer to disable CI.
>> That's
>> > > >>>>>>> just my personal preference though.
>> > > >>>>>>>
>> > > >>>>>>> >
>> > > >>>>>>> > Regards,
>> > > >>>>>>> > Manu
>> > > >>>>>>> >
>> > > >>>>>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <
>> [email protected]>
>> > > >>>>>>> wrote:
>> > > >>>>>>> >>
>> > > >>>>>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <
>> [email protected]>
>> > > >>>>>>> wrote:
>> > > >>>>>>> >> >
>> > > >>>>>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <
>> > > >>>>>>> [email protected]> wrote:
>> > > >>>>>>> >> > >
>> > > >>>>>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days
>> ago.
>> > > >>>>>>> It should reduce the Spark CI cost by ~25%.
>> > > >>>>>>> >> > >
>> > > >>>>>>> >> > > Some heavy-hitter test classes in Spark tests (core and
>> > > >>>>>>> extension) cause high load due to parameter combinations. I
>> asked AI to
>> > > >>>>>>> analyze the build log and recommend changes offering the best
>> ROI. Details
>> > > >>>>>>> are in this doc.
>> > > >>>>>>> >> > >
>> > > >>>>>>> >> > > I can look into dropping some combinations without
>> > > >>>>>>> sacrificing essential coverage. E.g., we can probably drop
>> the Hadoop
>> > > >>>>>>> catalog usage in test, as it wasn't recommended for
>> production use anyway.
>> > > >>>>>>> >> >
>> > > >>>>>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI
>> > > >>>>>>> resource
>> > > >>>>>>> >> > usage a little bit. Perhaps we should apply the same
>> approach
>> > > >>>>>>> across
>> > > >>>>>>> >> > all iceberg subprojects?
>> > > >>>>>>> >> >
>> > > >>>>>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
>> > > >>>>>>> >>
>> > > >>>>>>> >> I've created a PR to show that, see [1], since it's a
>> draft, the
>> > > >>>>>>> CI
>> > > >>>>>>> >> won't run. If I click the `Ready for review` button, the
>> actions
>> > > >>>>>>> will
>> > > >>>>>>> >> be triggered. Let me know what you think about it.
>> > > >>>>>>> >>
>> > > >>>>>>> >> [1] https://github.com/apache/iceberg/pull/16561
>> > > >>>>>>> >>
>> > > >>>>>>> >> >
>> > > >>>>>>> >> > >
>> > > >>>>>>> >> > >
>> > > >>>>>>> >> > >
>> > > >>>>>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
>> > > >>>>>>> [email protected]> wrote:
>> > > >>>>>>> >> > >>
>> > > >>>>>>> >> > >> Apache DataFusion similarly received this notice. For
>> > > >>>>>>> visibility to the Iceberg community, we have tracking issues
>> to try to
>> > > >>>>>>> discuss solutions:
>> > > >>>>>>> >> > >>
>> > > >>>>>>> >> > >> https://github.com/apache/datafusion/issues/22455
>> > > >>>>>>> >> > >>
>> https://github.com/apache/datafusion-comet/issues/4406
>> > > >>>>>>> >> > >>
>> > > >>>>>>> >> > >> DataFusion Comet is consuming the vast majority of
>> > > >>>>>>> DataFusion resources, and like the Iceberg project it's due
>> to Spark tests
>> > > >>>>>>> (and Iceberg's Spark tests). We are doing some analysis on
>> what subsets
>> > > >>>>>>> might be appropriate for our workflows, features, and goals,
>> and will share
>> > > >>>>>>> anything that we think might translate back to the Iceberg CI
>> workflows.
>> > > >>>>>>> >> > >>
>> > > >>>>>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
>> > > >>>>>>> [email protected]> wrote:
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> Hello, Iceberg PMC.
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> In 2024, the ASF introduced the policy for GitHub
>> Actions
>> > > >>>>>>> usage
>> > > >>>>>>> >> > >>> across the foundation[1]. The ASF Github shared pool
>> of
>> > > >>>>>>> >> > >>> Github-hosted runners has been at, or very close to
>> the
>> > > >>>>>>> limit of
>> > > >>>>>>> >> > >>> 900 jobs most of the time in the past few weeks and
>> this is
>> > > >>>>>>> the
>> > > >>>>>>> >> > >>> case again today.
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> Your project has been identified as being among the
>> top 5
>> > > >>>>>>> consumers of
>> > > >>>>>>> >> > >>> build time over the past 7 days and we request that
>> you
>> > > >>>>>>> bring your
>> > > >>>>>>> >> > >>> usage down by stream-lining long-running builds.
>> Contact
>> > > >>>>>>> Infra for
>> > > >>>>>>> >> > >>> a consultation if you are unable to streamline your
>> builds
>> > > >>>>>>> further.
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> You can use the infra reporting tool[2] to monitor
>> your GHA
>> > > >>>>>>> usage as you
>> > > >>>>>>> >> > >>> work on stream-lining, as well as locate any
>> bottlenecks in
>> > > >>>>>>> the workflows.
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> Infra will allow you two weeks time (till the 8th of
>> June,
>> > > >>>>>>> 2026) to
>> > > >>>>>>> >> > >>> progress this, but should you still be above the
>> limits by
>> > > >>>>>>> then,
>> > > >>>>>>> >> > >>> without a viable path forward, we will be limiting
>> your GHA
>> > > >>>>>>> usage.
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> Kind regards,
>> > > >>>>>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> > >>> [1]
>> https://infra.apache.org/github-actions-policy.html
>> > > >>>>>>> >> > >>> [2]
>> > > >>>>>>>
>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
>> > > >>>>>>> >> > >>>
>> > > >>>>>>> >> >
>> > > >>>>>>> >> >
>> > > >>>>>>> >> > --
>> > > >>>>>>> >> > Regards
>> > > >>>>>>> >> > Junwang Zhao
>> > > >>>>>>> >>
>> > > >>>>>>> >>
>> > > >>>>>>> >>
>> > > >>>>>>> >> --
>> > > >>>>>>> >> Regards
>> > > >>>>>>> >> Junwang Zhao
>> > > >>>>>>>
>> > > >>>>>>>
>> > > >>>>>>>
>> > > >>>>>>> --
>> > > >>>>>>> Regards
>> > > >>>>>>> Junwang Zhao
>> > > >>>>>>>
>> > > >>>>>>
>> > >
>> >
>>
>

Reply via email to