wombatu-kun opened a new pull request, #16565: URL: https://github.com/apache/iceberg/pull/16565
Part of #16397. `TestRewriteDataFilesAction` is the #2 slowest Spark test class in `spark-ci` (~12.4 min in the profiling gist linked from #16397). It is parameterized only on `formatVersion = [2, 3, 4]` — each version is meaningful (v2 position deletes, v3 deletion vectors, v4 Parquet manifests) — so its matrix cannot be trimmed. Its runtime is instead dominated by data volume: a shared `SCALE = 400000` consumed by ~50 `@TestTemplate` methods that each write and then rewrite ~400k rows, across three format versions. ## What changed Most methods only assert on file/snapshot counts and rewrite structure, which do not depend on the absolute row count, so they now use a small `SCALE = 400`. The few methods whose assertions genuinely depend on large files keep the original volume via a new `LARGE_SCALE = 400000` constant, so they stay byte-for-byte equivalent: `testBinPackSplitLargeFile`, `testBinPackCombineMixedFiles`, `testBinPackCombineMediumFiles`, `testAutoSortShuffleOutput`, and `testZOrderSort`. This mirrors the sibling `TestRewritePositionDeleteFilesAction`, which already uses `SCALE = 4000`. The same change is applied identically to the v3.5, v4.0, and v4.1 Spark trees. ## Measured impact Measured locally as the JUnit testsuite time summed across the three `formatVersion` suites, three runs each, via `cleanTest test --no-build-cache` (forces real re-execution, no cache): | | mean of 3 runs | tests | | --- | --- | --- | | Before | 688 s (11.5 min) | 171 pass / 0 fail | | After | 316 s (5.3 min) | 171 pass / 0 fail | That is a ~54% reduction (≈60% at warm steady-state). Test counts and pass/fail are unchanged across all three trees, so coverage is preserved — only the data volume shrank. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
