[PR] Spark tests cache rewrite input [iceberg]

via GitHub Tue, 09 Jun 2026 01:30:10 -0700


Baunsgaard opened a new pull request, #16740:
URL: https://github.com/apache/iceberg/pull/16740


   ## What
   `TestRewriteDataFilesAction` is the slowest single class in the Spark core 
test
   module. Each test materializes a large (`SCALE = 400000`-row) input table 
via a
   Spark write before exercising the rewrite under test, and many tests reuse 
the
   same input shape across the `formatVersion` matrix.
   This caches the written input data files keyed by table shape 
(`formatVersion`,
   spec, `files`, `rows`, `partitions`, properties) and reuses them by 
re-appending
   the cached `DataFile`s to a fresh table. The expensive Spark write of the 
input
   now runs once per JVM fork instead of once per test; the rewrite under test 
still
   runs per test on its own fresh table.
   Applied identically to Spark 3.5, 4.0 and 4.1.
   ## Why it is safe
   - The generated data is deterministic (fixed `Random(42)` seed), so reuse is
     byte-identical to regenerating it.
   - Cached files live in a static `@TempDir`, so they survive across tests 
(not wiped
     by the per-test temp dir) and are cleaned up after the class.
   - `includeColumnStats()` is used when collecting the cached files so 
lower/upper
     bounds and value counts are preserved on re-append.
   - The rewrite under test is unchanged and still runs per test, so no 
assertion is
     weakened.
   ## Results (local, JDK 17, 32 cores)
   | Scope | baseline | with cache |
   |---|---:|---:|
   | `TestRewriteDataFilesAction` (single-thread) | 705s | 455s |
   | full `iceberg-spark-3.5_2.13` core module (`testParallelism=auto`) | 
18m56s | 14m57s |
   Test/skip counts are unchanged at both class and module level:
   Spark 3.5 = 168 tests / 6 skipped / 0 failed; Spark 4.0 & 4.1 = 171 / 6 / 0.
   ## Notes
   - Scoped to the `createTable(int)` / `createTablePartitioned(...)` helpers; 
the few
     in-test `writeRecords(..., SCALE, ...)` call sites are not yet cached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark tests cache rewrite input [iceberg]

Reply via email to