Thanks everyone for the great ideas.

Here's where we stand today with respect to ASF runner usage (taken from
the link [2] above):
GitHub Actions Build Time Used
- past 7 days total usage: 218,321 minutes
- past 5 days total usage: 120,241 minutes

*This puts us below the hard ceiling for resource usage* as described by
https://infra.apache.org/github-actions-policy.html

> The average number of minutes a project uses *per calendar week MUST NOT
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200
hours)*.
> The average number of minutes a project uses *in any consecutive five-day
period MUST NOT exceed the equivalent of 30 full-time runners (216,000
minutes, or 3,600 hours)*.

We should still make improvements wherever possible.

I have a few PRs to reduce CI usage further.
- CI: Limit CVE scan runs to relevant changes #16513
- Build: Simplify CI workflow path filters to avoid per-workflow
maintenance #16302

There are a couple of heuristics we can use
1. Don't run CI if not needed. For example, `site/` dir changes shouldn't
trigger Spark/Flink/Java CI. This might be optimized already, but we should
double check just in case.
2. If we must run CI, fail fast. For example, if there is a formatter
issue, fail all inflight CI tasks.
3. Within a specific CI workflow, reduce the matrix wherever possible. Do
we really need to run all "Java versions" x "Scala versions" x "Spark
versions"?
4. Improve individual CI tasks. Spark CI dominates 57% of all resource
usage. I have a tracking issue where I benchmarked where all that time is
spent. See https://github.com/apache/iceberg/issues/16397

Top CI tasks as % of resource use:
- Spark CI: 57.68%
- Flink CI: 13.60%
- Java CI: 7.02%
- CVE Scan: 3.13%

Best,
Kevin Liu

On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]> wrote:

> Hi all,
>
> How about implementing the incremental PR builder? (similar to
> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder
> )
>
> I think one of the main causes of GitHub runner pressure in Iceberg is the
> breadth of our CI matrix. We support multiple languages (java, python, go,
> rust, cpp) and integrations, and for Java we test across multiple JVM
> versions, Spark versions, Flink versions, Kafka, Hive/MR, REST/OpenAPI,
> runtime bundles, and more. That coverage is valuable, but running most of
> it for every PR is expensive and increases both runner usage and CI wall
> time.
>
> I think the biggest win can be achieved by having an incremental PR build.
> We already have useful building blocks for it: Gradle build cache, path
> filters, and version-selective build properties like -DsparkVersions and
> -DflinkVersions.
>
> The idea is to keep full coverage on main, release branches, tags, and
> global build changes, but make PR CI depend on the files changed:
>
>    - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
>    - spark/v4.1/** changes run only Spark 4.1, not every Spark version.
>    - flink/v2.0/** changes run only Flink 2.0, not every Flink version.
>    - API/Core/Data/File format changes run the owning Java checks plus
>    selected downstream canaries, such as latest Spark and latest Flink,
>    instead of the full engine matrix.
>    - Runtime/bundle CVE checks run only for affected runtime artifacts.
>    - A full-ci label or global Gradle/workflow changes can still force
>    the full matrix.
>
>
> Another possible optimization is JVM coverage. Today many PR jobs run
> across both Java 17 and Java 21. We could consider running one primary JVM
> for PRs, and reserve the full JVM matrix for main, release branches,
> nightly/scheduled builds, or PRs labeled full-ci. That would further reduce
> runner usage and PR wall time, while still preserving broad compatibility
> coverage before changes become part of the main branch.
>
> A practical approach could be:
>
> PRs: incremental module/version selection, mostly one JVM, plus targeted
> canaries.
> main: full matrix across JVMs, Spark versions, Flink versions, and runtime
> checks.
> Manual override: full-ci label for risky or cross-cutting PRs.
>
> This should reduce queue time, lower GitHub runner consumption, and give
> contributors faster feedback without giving up full coverage where it
> matters most.
>
> I am working on a POC https://github.com/apache/iceberg/pull/16566
> Suggestions are welcome.
>
> - Ajantha
>
> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]> wrote:
>
>> Hi Manu,
>>
>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]>
>> wrote:
>> >
>> > Hi Junwang,
>> >
>> > Not sure about others but I usually only change status to "Ready for
>> review"  when CI has passed.
>>
>> Yeah, I agree there are trade-offs to disabling gh actions for draft PRs.
>>
>> Reasons to Disable:
>>
>> - Cost savings: large teams and monorepos can burn through GitHub
>> Actions minutes quickly. Skipping CI for draft PRs avoids spending
>> resources on code that may not even compile yet.
>> - Reduced noise: draft PRs are often used for experimentation or
>> work-in-progress changes. Disabling CI avoids cluttering the PR
>> timeline with transient failures while the author is still iterating.
>> - Better resource utilization: orgs with limited self-hosted runners
>> may prefer to prioritize "Ready for Review" PRs so production-relevant
>> changes get feedback and merge capacity sooner.
>>
>> Reasons to Keep:
>>
>> - Early error detection: developers can use draft PRs as a sandbox to
>> validate builds and tests before requesting review.
>> - Self-correction: failed checks on a draft PR allow authors to fix
>> lint or test issues before involving reviewers.
>> - Higher review confidence: by the time a PR is marked "Ready for
>> Review", CI has often already passed at least once, leading to a
>> smoother review process.
>>
>> For myself, when I create a draft PR, I'm usually sharing early
>> work-in-progress code with other developers and may not have tested it
>> thoroughly locally yet, so I sometimes prefer to disable CI. That's
>> just my personal preference though.
>>
>> >
>> > Regards,
>> > Manu
>> >
>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]> wrote:
>> >>
>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]>
>> wrote:
>> >> >
>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <[email protected]>
>> wrote:
>> >> > >
>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago. It
>> should reduce the Spark CI cost by ~25%.
>> >> > >
>> >> > > Some heavy-hitter test classes in Spark tests (core and extension)
>> cause high load due to parameter combinations. I asked AI to analyze the
>> build log and recommend changes offering the best ROI. Details are in this
>> doc.
>> >> > >
>> >> > > I can look into dropping some combinations without sacrificing
>> essential coverage. E.g., we can probably drop the Hadoop catalog usage in
>> test, as it wasn't recommended for production use anyway.
>> >> >
>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI resource
>> >> > usage a little bit. Perhaps we should apply the same approach across
>> >> > all iceberg subprojects?
>> >> >
>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
>> >>
>> >> I've created a PR to show that, see [1], since it's a draft, the CI
>> >> won't run. If I click the `Ready for review` button, the actions will
>> >> be triggered. Let me know what you think about it.
>> >>
>> >> [1] https://github.com/apache/iceberg/pull/16561
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
>> [email protected]> wrote:
>> >> > >>
>> >> > >> Apache DataFusion similarly received this notice. For visibility
>> to the Iceberg community, we have tracking issues to try to discuss
>> solutions:
>> >> > >>
>> >> > >> https://github.com/apache/datafusion/issues/22455
>> >> > >> https://github.com/apache/datafusion-comet/issues/4406
>> >> > >>
>> >> > >> DataFusion Comet is consuming the vast majority of DataFusion
>> resources, and like the Iceberg project it's due to Spark tests (and
>> Iceberg's Spark tests). We are doing some analysis on what subsets might be
>> appropriate for our workflows, features, and goals, and will share anything
>> that we think might translate back to the Iceberg CI workflows.
>> >> > >>
>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
>> [email protected]> wrote:
>> >> > >>>
>> >> > >>> Hello, Iceberg PMC.
>> >> > >>>
>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions usage
>> >> > >>> across the foundation[1]. The ASF Github shared pool of
>> >> > >>> Github-hosted runners has been at, or very close to the limit of
>> >> > >>> 900 jobs most of the time in the past few weeks and this is the
>> >> > >>> case again today.
>> >> > >>>
>> >> > >>> Your project has been identified as being among the top 5
>> consumers of
>> >> > >>> build time over the past 7 days and we request that you bring
>> your
>> >> > >>> usage down by stream-lining long-running builds. Contact Infra
>> for
>> >> > >>> a consultation if you are unable to streamline your builds
>> further.
>> >> > >>>
>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA
>> usage as you
>> >> > >>> work on stream-lining, as well as locate any bottlenecks in the
>> workflows.
>> >> > >>>
>> >> > >>> Infra will allow you two weeks time (till the 8th of June, 2026)
>> to
>> >> > >>> progress this, but should you still be above the limits by then,
>> >> > >>> without a viable path forward, we will be limiting your GHA
>> usage.
>> >> > >>>
>> >> > >>> Kind regards,
>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
>> >> > >>>
>> >> > >>>
>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html
>> >> > >>> [2]
>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
>> >> > >>>
>> >> >
>> >> >
>> >> > --
>> >> > Regards
>> >> > Junwang Zhao
>> >>
>> >>
>> >>
>> >> --
>> >> Regards
>> >> Junwang Zhao
>>
>>
>>
>> --
>> Regards
>> Junwang Zhao
>>
>

Reply via email to