Baunsgaard opened a new pull request, #16696:
URL: https://github.com/apache/iceberg/pull/16696

   ## What
   
   `RandomData` (Spark test helper) sized every generated list and map with
   `random.nextInt(20)`. Because the bound is applied at *every* nesting level, 
it
   multiplies for deeply-nested schemas. The worst case is
   `AvroDataTestBase.testMixedTypes`, which embeds the full ~19-field primitive
   struct two-to-three levels deep across five fields — each run generates well
   over a million leaf values, so the test cost is dominated by random-data 
volume
   rather than the read/write code paths being exercised.
   
   This replaces the hard-coded `20` with a named constant
   `MAX_COLLECTION_SIZE = 10` in the Spark 3.5 / 4.0 / 4.1 test copies.
   
   ## Why
   
   `testMixedTypes` is the single most expensive test method in the
   `iceberg-spark` core suite, appearing at the top of every format read/write
   test class. The collection size has no bearing on coverage — the schemas,
   types, and nesting structures under test are identical regardless of how many
   elements each collection holds — so this is pure scaffolding overhead.
   
   ## Impact
   
   Measured locally (JDK 17, Spark 3.5 core), `testMixedTypes` per class,
   single-threaded:
   
   | Class | before | after |
   |---|---:|---:|
   | TestAvroDataFrameWrite | 24.1s | 7.8s |
   | TestParquetDataFrameWrite | 20.0s | 3.6s |
   | TestORCDataFrameWrite | 19.9s | 3.4s |
   | TestParquetScan | 17.7s | 2.6s |
   | TestParquetVectorizedScan | 17.5s | 2.3s |
   | TestAvroScan | 17.5s | 2.4s |
   
   Collections still hold up to nine elements, preserving data variety.
   
   ## Testing
   
   `./gradlew :iceberg-spark:iceberg-spark-3.5_2.13:test` — **5,084 tests, 0
   failures** (identical pass/skip counts to before the change).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to