I like the idea of cutting supported jvm runs in each ci. JVM has great backward compatibility, and we run on one jvm (maybe jvm 17) and trigger a nightly run for jvm 21.
On Wed, May 27, 2026 at 3:17 AM Steve Loughran <[email protected]> wrote: > > Doing a scan of the aws-sdk bundle.jar is halfway to an audit of the > maven repo, with spark the other half. > > It seems to me that only PRs which go near gradle/libs.versions.toml are > going to change dependences, so introduce new CVEs. > > There's the separate issue "CVEs are eternal" and all existing > dependencies are collections of undiscovered/unreported cves. That's > dependabot's homework, generally. > > > On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> wrote: > >> Thanks everyone for the great ideas. >> >> Here's where we stand today with respect to ASF runner usage (taken from >> the link [2] above): >> GitHub Actions Build Time Used >> - past 7 days total usage: 218,321 minutes >> - past 5 days total usage: 120,241 minutes >> >> *This puts us below the hard ceiling for resource usage* as described by >> https://infra.apache.org/github-actions-policy.html >> >> > The average number of minutes a project uses *per calendar week MUST >> NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or >> 4,200 hours)*. >> > The average number of minutes a project uses *in any consecutive >> five-day period MUST NOT exceed the equivalent of 30 full-time runners >> (216,000 minutes, or 3,600 hours)*. >> >> We should still make improvements wherever possible. >> >> I have a few PRs to reduce CI usage further. >> - CI: Limit CVE scan runs to relevant changes #16513 >> - Build: Simplify CI workflow path filters to avoid per-workflow >> maintenance #16302 >> >> There are a couple of heuristics we can use >> 1. Don't run CI if not needed. For example, `site/` dir changes shouldn't >> trigger Spark/Flink/Java CI. This might be optimized already, but we should >> double check just in case. >> 2. If we must run CI, fail fast. For example, if there is a formatter >> issue, fail all inflight CI tasks. >> 3. Within a specific CI workflow, reduce the matrix wherever possible. Do >> we really need to run all "Java versions" x "Scala versions" x "Spark >> versions"? >> 4. Improve individual CI tasks. Spark CI dominates 57% of all resource >> usage. I have a tracking issue where I benchmarked where all that time is >> spent. See https://github.com/apache/iceberg/issues/16397 >> >> Top CI tasks as % of resource use: >> - Spark CI: 57.68% >> - Flink CI: 13.60% >> - Java CI: 7.02% >> - CVE Scan: 3.13% >> >> Best, >> Kevin Liu >> >> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]> >> wrote: >> >>> Hi all, >>> >>> How about implementing the incremental PR builder? (similar to >>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder >>> ) >>> >>> I think one of the main causes of GitHub runner pressure in Iceberg is >>> the breadth of our CI matrix. We support multiple languages (java, python, >>> go, rust, cpp) and integrations, and for Java we test across multiple JVM >>> versions, Spark versions, Flink versions, Kafka, Hive/MR, REST/OpenAPI, >>> runtime bundles, and more. That coverage is valuable, but running most of >>> it for every PR is expensive and increases both runner usage and CI wall >>> time. >>> >>> I think the biggest win can be achieved by having an incremental PR >>> build. >>> We already have useful building blocks for it: Gradle build cache, path >>> filters, and version-selective build properties like -DsparkVersions and >>> -DflinkVersions. >>> >>> The idea is to keep full coverage on main, release branches, tags, and >>> global build changes, but make PR CI depend on the files changed: >>> >>> - Spark-only changes run Spark CI, not Flink/Hive/Kafka. >>> - spark/v4.1/** changes run only Spark 4.1, not every Spark version. >>> - flink/v2.0/** changes run only Flink 2.0, not every Flink version. >>> - API/Core/Data/File format changes run the owning Java checks plus >>> selected downstream canaries, such as latest Spark and latest Flink, >>> instead of the full engine matrix. >>> - Runtime/bundle CVE checks run only for affected runtime artifacts. >>> - A full-ci label or global Gradle/workflow changes can still force >>> the full matrix. >>> >>> >>> Another possible optimization is JVM coverage. Today many PR jobs run >>> across both Java 17 and Java 21. We could consider running one primary JVM >>> for PRs, and reserve the full JVM matrix for main, release branches, >>> nightly/scheduled builds, or PRs labeled full-ci. That would further reduce >>> runner usage and PR wall time, while still preserving broad compatibility >>> coverage before changes become part of the main branch. >>> >>> A practical approach could be: >>> >>> PRs: incremental module/version selection, mostly one JVM, plus targeted >>> canaries. >>> main: full matrix across JVMs, Spark versions, Flink versions, and >>> runtime checks. >>> Manual override: full-ci label for risky or cross-cutting PRs. >>> >>> This should reduce queue time, lower GitHub runner consumption, and give >>> contributors faster feedback without giving up full coverage where it >>> matters most. >>> >>> I am working on a POC https://github.com/apache/iceberg/pull/16566 >>> Suggestions are welcome. >>> >>> - Ajantha >>> >>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]> wrote: >>> >>>> Hi Manu, >>>> >>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]> >>>> wrote: >>>> > >>>> > Hi Junwang, >>>> > >>>> > Not sure about others but I usually only change status to "Ready for >>>> review" when CI has passed. >>>> >>>> Yeah, I agree there are trade-offs to disabling gh actions for draft >>>> PRs. >>>> >>>> Reasons to Disable: >>>> >>>> - Cost savings: large teams and monorepos can burn through GitHub >>>> Actions minutes quickly. Skipping CI for draft PRs avoids spending >>>> resources on code that may not even compile yet. >>>> - Reduced noise: draft PRs are often used for experimentation or >>>> work-in-progress changes. Disabling CI avoids cluttering the PR >>>> timeline with transient failures while the author is still iterating. >>>> - Better resource utilization: orgs with limited self-hosted runners >>>> may prefer to prioritize "Ready for Review" PRs so production-relevant >>>> changes get feedback and merge capacity sooner. >>>> >>>> Reasons to Keep: >>>> >>>> - Early error detection: developers can use draft PRs as a sandbox to >>>> validate builds and tests before requesting review. >>>> - Self-correction: failed checks on a draft PR allow authors to fix >>>> lint or test issues before involving reviewers. >>>> - Higher review confidence: by the time a PR is marked "Ready for >>>> Review", CI has often already passed at least once, leading to a >>>> smoother review process. >>>> >>>> For myself, when I create a draft PR, I'm usually sharing early >>>> work-in-progress code with other developers and may not have tested it >>>> thoroughly locally yet, so I sometimes prefer to disable CI. That's >>>> just my personal preference though. >>>> >>>> > >>>> > Regards, >>>> > Manu >>>> > >>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]> >>>> wrote: >>>> >> >>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]> >>>> wrote: >>>> >> > >>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <[email protected]> >>>> wrote: >>>> >> > > >>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago. It >>>> should reduce the Spark CI cost by ~25%. >>>> >> > > >>>> >> > > Some heavy-hitter test classes in Spark tests (core and >>>> extension) cause high load due to parameter combinations. I asked AI to >>>> analyze the build log and recommend changes offering the best ROI. Details >>>> are in this doc. >>>> >> > > >>>> >> > > I can look into dropping some combinations without sacrificing >>>> essential coverage. E.g., we can probably drop the Hadoop catalog usage in >>>> test, as it wasn't recommended for production use anyway. >>>> >> > >>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI resource >>>> >> > usage a little bit. Perhaps we should apply the same approach >>>> across >>>> >> > all iceberg subprojects? >>>> >> > >>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680 >>>> >> >>>> >> I've created a PR to show that, see [1], since it's a draft, the CI >>>> >> won't run. If I click the `Ready for review` button, the actions will >>>> >> be triggered. Let me know what you think about it. >>>> >> >>>> >> [1] https://github.com/apache/iceberg/pull/16561 >>>> >> >>>> >> > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich < >>>> [email protected]> wrote: >>>> >> > >> >>>> >> > >> Apache DataFusion similarly received this notice. For >>>> visibility to the Iceberg community, we have tracking issues to try to >>>> discuss solutions: >>>> >> > >> >>>> >> > >> https://github.com/apache/datafusion/issues/22455 >>>> >> > >> https://github.com/apache/datafusion-comet/issues/4406 >>>> >> > >> >>>> >> > >> DataFusion Comet is consuming the vast majority of DataFusion >>>> resources, and like the Iceberg project it's due to Spark tests (and >>>> Iceberg's Spark tests). We are doing some analysis on what subsets might be >>>> appropriate for our workflows, features, and goals, and will share anything >>>> that we think might translate back to the Iceberg CI workflows. >>>> >> > >> >>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson < >>>> [email protected]> wrote: >>>> >> > >>> >>>> >> > >>> Hello, Iceberg PMC. >>>> >> > >>> >>>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions usage >>>> >> > >>> across the foundation[1]. The ASF Github shared pool of >>>> >> > >>> Github-hosted runners has been at, or very close to the limit >>>> of >>>> >> > >>> 900 jobs most of the time in the past few weeks and this is the >>>> >> > >>> case again today. >>>> >> > >>> >>>> >> > >>> Your project has been identified as being among the top 5 >>>> consumers of >>>> >> > >>> build time over the past 7 days and we request that you bring >>>> your >>>> >> > >>> usage down by stream-lining long-running builds. Contact Infra >>>> for >>>> >> > >>> a consultation if you are unable to streamline your builds >>>> further. >>>> >> > >>> >>>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA >>>> usage as you >>>> >> > >>> work on stream-lining, as well as locate any bottlenecks in >>>> the workflows. >>>> >> > >>> >>>> >> > >>> Infra will allow you two weeks time (till the 8th of June, >>>> 2026) to >>>> >> > >>> progress this, but should you still be above the limits by >>>> then, >>>> >> > >>> without a viable path forward, we will be limiting your GHA >>>> usage. >>>> >> > >>> >>>> >> > >>> Kind regards, >>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure. >>>> >> > >>> >>>> >> > >>> >>>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html >>>> >> > >>> [2] >>>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name >>>> >> > >>> >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > Regards >>>> >> > Junwang Zhao >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Regards >>>> >> Junwang Zhao >>>> >>>> >>>> >>>> -- >>>> Regards >>>> Junwang Zhao >>>> >>>
