Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Renjie Liu Fri, 29 May 2026 20:47:56 -0700

I like the idea of cutting supported jvm runs in each ci. JVM has great
backward compatibility, and we run on one jvm (maybe jvm 17) and trigger a
nightly run for jvm 21.


On Wed, May 27, 2026 at 3:17 AM Steve Loughran <[email protected]> wrote:

>
> Doing a scan of the aws-sdk bundle.jar is halfway to an audit of the
> maven repo, with spark the other half.
>
> It seems to me that only PRs which go near gradle/libs.versions.toml are
> going to change dependences, so introduce new CVEs.
>
> There's the separate issue "CVEs are eternal" and all existing
> dependencies are collections of undiscovered/unreported cves. That's
> dependabot's homework, generally.
>
>
> On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> wrote:
>
>> Thanks everyone for the great ideas.
>>
>> Here's where we stand today with respect to ASF runner usage (taken from
>> the link [2] above):
>> GitHub Actions Build Time Used
>> - past 7 days total usage: 218,321 minutes
>> - past 5 days total usage: 120,241 minutes
>>
>> *This puts us below the hard ceiling for resource usage* as described by
>> https://infra.apache.org/github-actions-policy.html
>>
>> > The average number of minutes a project uses *per calendar week MUST
>> NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or
>> 4,200 hours)*.
>> > The average number of minutes a project uses *in any consecutive
>> five-day period MUST NOT exceed the equivalent of 30 full-time runners
>> (216,000 minutes, or 3,600 hours)*.
>>
>> We should still make improvements wherever possible.
>>
>> I have a few PRs to reduce CI usage further.
>> - CI: Limit CVE scan runs to relevant changes #16513
>> - Build: Simplify CI workflow path filters to avoid per-workflow
>> maintenance #16302
>>
>> There are a couple of heuristics we can use
>> 1. Don't run CI if not needed. For example, `site/` dir changes shouldn't
>> trigger Spark/Flink/Java CI. This might be optimized already, but we should
>> double check just in case.
>> 2. If we must run CI, fail fast. For example, if there is a formatter
>> issue, fail all inflight CI tasks.
>> 3. Within a specific CI workflow, reduce the matrix wherever possible. Do
>> we really need to run all "Java versions" x "Scala versions" x "Spark
>> versions"?
>> 4. Improve individual CI tasks. Spark CI dominates 57% of all resource
>> usage. I have a tracking issue where I benchmarked where all that time is
>> spent. See https://github.com/apache/iceberg/issues/16397
>>
>> Top CI tasks as % of resource use:
>> - Spark CI: 57.68%
>> - Flink CI: 13.60%
>> - Java CI: 7.02%
>> - CVE Scan: 3.13%
>>
>> Best,
>> Kevin Liu
>>
>> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> How about implementing the incremental PR builder? (similar to
>>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder
>>> )
>>>
>>> I think one of the main causes of GitHub runner pressure in Iceberg is
>>> the breadth of our CI matrix. We support multiple languages (java, python,
>>> go, rust, cpp) and integrations, and for Java we test across multiple JVM
>>> versions, Spark versions, Flink versions, Kafka, Hive/MR, REST/OpenAPI,
>>> runtime bundles, and more. That coverage is valuable, but running most of
>>> it for every PR is expensive and increases both runner usage and CI wall
>>> time.
>>>
>>> I think the biggest win can be achieved by having an incremental PR
>>> build.
>>> We already have useful building blocks for it: Gradle build cache, path
>>> filters, and version-selective build properties like -DsparkVersions and
>>> -DflinkVersions.
>>>
>>> The idea is to keep full coverage on main, release branches, tags, and
>>> global build changes, but make PR CI depend on the files changed:
>>>
>>>    - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
>>>    - spark/v4.1/** changes run only Spark 4.1, not every Spark version.
>>>    - flink/v2.0/** changes run only Flink 2.0, not every Flink version.
>>>    - API/Core/Data/File format changes run the owning Java checks plus
>>>    selected downstream canaries, such as latest Spark and latest Flink,
>>>    instead of the full engine matrix.
>>>    - Runtime/bundle CVE checks run only for affected runtime artifacts.
>>>    - A full-ci label or global Gradle/workflow changes can still force
>>>    the full matrix.
>>>
>>>
>>> Another possible optimization is JVM coverage. Today many PR jobs run
>>> across both Java 17 and Java 21. We could consider running one primary JVM
>>> for PRs, and reserve the full JVM matrix for main, release branches,
>>> nightly/scheduled builds, or PRs labeled full-ci. That would further reduce
>>> runner usage and PR wall time, while still preserving broad compatibility
>>> coverage before changes become part of the main branch.
>>>
>>> A practical approach could be:
>>>
>>> PRs: incremental module/version selection, mostly one JVM, plus targeted
>>> canaries.
>>> main: full matrix across JVMs, Spark versions, Flink versions, and
>>> runtime checks.
>>> Manual override: full-ci label for risky or cross-cutting PRs.
>>>
>>> This should reduce queue time, lower GitHub runner consumption, and give
>>> contributors faster feedback without giving up full coverage where it
>>> matters most.
>>>
>>> I am working on a POC https://github.com/apache/iceberg/pull/16566
>>> Suggestions are welcome.
>>>
>>> - Ajantha
>>>
>>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]> wrote:
>>>
>>>> Hi Manu,
>>>>
>>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]>
>>>> wrote:
>>>> >
>>>> > Hi Junwang,
>>>> >
>>>> > Not sure about others but I usually only change status to "Ready for
>>>> review"  when CI has passed.
>>>>
>>>> Yeah, I agree there are trade-offs to disabling gh actions for draft
>>>> PRs.
>>>>
>>>> Reasons to Disable:
>>>>
>>>> - Cost savings: large teams and monorepos can burn through GitHub
>>>> Actions minutes quickly. Skipping CI for draft PRs avoids spending
>>>> resources on code that may not even compile yet.
>>>> - Reduced noise: draft PRs are often used for experimentation or
>>>> work-in-progress changes. Disabling CI avoids cluttering the PR
>>>> timeline with transient failures while the author is still iterating.
>>>> - Better resource utilization: orgs with limited self-hosted runners
>>>> may prefer to prioritize "Ready for Review" PRs so production-relevant
>>>> changes get feedback and merge capacity sooner.
>>>>
>>>> Reasons to Keep:
>>>>
>>>> - Early error detection: developers can use draft PRs as a sandbox to
>>>> validate builds and tests before requesting review.
>>>> - Self-correction: failed checks on a draft PR allow authors to fix
>>>> lint or test issues before involving reviewers.
>>>> - Higher review confidence: by the time a PR is marked "Ready for
>>>> Review", CI has often already passed at least once, leading to a
>>>> smoother review process.
>>>>
>>>> For myself, when I create a draft PR, I'm usually sharing early
>>>> work-in-progress code with other developers and may not have tested it
>>>> thoroughly locally yet, so I sometimes prefer to disable CI. That's
>>>> just my personal preference though.
>>>>
>>>> >
>>>> > Regards,
>>>> > Manu
>>>> >
>>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]>
>>>> wrote:
>>>> >> >
>>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <[email protected]>
>>>> wrote:
>>>> >> > >
>>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago. It
>>>> should reduce the Spark CI cost by ~25%.
>>>> >> > >
>>>> >> > > Some heavy-hitter test classes in Spark tests (core and
>>>> extension) cause high load due to parameter combinations. I asked AI to
>>>> analyze the build log and recommend changes offering the best ROI. Details
>>>> are in this doc.
>>>> >> > >
>>>> >> > > I can look into dropping some combinations without sacrificing
>>>> essential coverage. E.g., we can probably drop the Hadoop catalog usage in
>>>> test, as it wasn't recommended for production use anyway.
>>>> >> >
>>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI resource
>>>> >> > usage a little bit. Perhaps we should apply the same approach
>>>> across
>>>> >> > all iceberg subprojects?
>>>> >> >
>>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
>>>> >>
>>>> >> I've created a PR to show that, see [1], since it's a draft, the CI
>>>> >> won't run. If I click the `Ready for review` button, the actions will
>>>> >> be triggered. Let me know what you think about it.
>>>> >>
>>>> >> [1] https://github.com/apache/iceberg/pull/16561
>>>> >>
>>>> >> >
>>>> >> > >
>>>> >> > >
>>>> >> > >
>>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
>>>> [email protected]> wrote:
>>>> >> > >>
>>>> >> > >> Apache DataFusion similarly received this notice. For
>>>> visibility to the Iceberg community, we have tracking issues to try to
>>>> discuss solutions:
>>>> >> > >>
>>>> >> > >> https://github.com/apache/datafusion/issues/22455
>>>> >> > >> https://github.com/apache/datafusion-comet/issues/4406
>>>> >> > >>
>>>> >> > >> DataFusion Comet is consuming the vast majority of DataFusion
>>>> resources, and like the Iceberg project it's due to Spark tests (and
>>>> Iceberg's Spark tests). We are doing some analysis on what subsets might be
>>>> appropriate for our workflows, features, and goals, and will share anything
>>>> that we think might translate back to the Iceberg CI workflows.
>>>> >> > >>
>>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
>>>> [email protected]> wrote:
>>>> >> > >>>
>>>> >> > >>> Hello, Iceberg PMC.
>>>> >> > >>>
>>>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions usage
>>>> >> > >>> across the foundation[1]. The ASF Github shared pool of
>>>> >> > >>> Github-hosted runners has been at, or very close to the limit
>>>> of
>>>> >> > >>> 900 jobs most of the time in the past few weeks and this is the
>>>> >> > >>> case again today.
>>>> >> > >>>
>>>> >> > >>> Your project has been identified as being among the top 5
>>>> consumers of
>>>> >> > >>> build time over the past 7 days and we request that you bring
>>>> your
>>>> >> > >>> usage down by stream-lining long-running builds. Contact Infra
>>>> for
>>>> >> > >>> a consultation if you are unable to streamline your builds
>>>> further.
>>>> >> > >>>
>>>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA
>>>> usage as you
>>>> >> > >>> work on stream-lining, as well as locate any bottlenecks in
>>>> the workflows.
>>>> >> > >>>
>>>> >> > >>> Infra will allow you two weeks time (till the 8th of June,
>>>> 2026) to
>>>> >> > >>> progress this, but should you still be above the limits by
>>>> then,
>>>> >> > >>> without a viable path forward, we will be limiting your GHA
>>>> usage.
>>>> >> > >>>
>>>> >> > >>> Kind regards,
>>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
>>>> >> > >>>
>>>> >> > >>>
>>>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html
>>>> >> > >>> [2]
>>>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
>>>> >> > >>>
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Regards
>>>> >> > Junwang Zhao
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Regards
>>>> >> Junwang Zhao
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>> Junwang Zhao
>>>>
>>>

Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Reply via email to