wombatu-kun opened a new pull request, #16565:
URL: https://github.com/apache/iceberg/pull/16565

   Part of #16397.
   
   `TestRewriteDataFilesAction` is the #2 slowest Spark test class in 
`spark-ci` (~12.4 min in the profiling gist linked from #16397). It is 
parameterized only on `formatVersion = [2, 3, 4]` — each version is meaningful 
(v2 position deletes, v3 deletion vectors, v4 Parquet manifests) — so its 
matrix cannot be trimmed. Its runtime is instead dominated by data volume: a 
shared `SCALE = 400000` consumed by ~50 `@TestTemplate` methods that each write 
and then rewrite ~400k rows, across three format versions.
   
   ## What changed
   
   Most methods only assert on file/snapshot counts and rewrite structure, 
which do not depend on the absolute row count, so they now use a small `SCALE = 
400`. The few methods whose assertions genuinely depend on large files keep the 
original volume via a new `LARGE_SCALE = 400000` constant, so they stay 
byte-for-byte equivalent: `testBinPackSplitLargeFile`, 
`testBinPackCombineMixedFiles`, `testBinPackCombineMediumFiles`, 
`testAutoSortShuffleOutput`, and `testZOrderSort`. This mirrors the sibling 
`TestRewritePositionDeleteFilesAction`, which already uses `SCALE = 4000`.
   
   The same change is applied identically to the v3.5, v4.0, and v4.1 Spark 
trees.
   
   ## Measured impact
   
   Measured locally as the JUnit testsuite time summed across the three 
`formatVersion` suites, three runs each, via `cleanTest test --no-build-cache` 
(forces real re-execution, no cache):
   
   | | mean of 3 runs | tests |
   | --- | --- | --- |
   | Before | 688 s (11.5 min) | 171 pass / 0 fail |
   | After | 316 s (5.3 min) | 171 pass / 0 fail |
   
   That is a ~54% reduction (≈60% at warm steady-state). Test counts and 
pass/fail are unchanged across all three trees, so coverage is preserved — only 
the data volume shrank.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to