[PR] Spark, Core: Use in-memory FileIO/Catalog in Spark tests for faster runs [iceberg]

via GitHub Tue, 16 Jun 2026 16:19:51 -0700


RussellSpitzer opened a new pull request, #16839:
URL: https://github.com/apache/iceberg/pull/16839


   ## Why
   
   Spark integration tests run against parameterized catalogs that all wrote to 
local disk, which dominated wall-clock time. Switching the test catalogs to 
`InMemoryFileIO` (and replacing the `HADOOP` parameterization with 
`InMemoryCatalog`) keeps the same coverage but executes much faster in a single 
JVM.
   
   A single-class wall-clock measurement 
(`TestRequiredDistributionAndOrdering`) showed roughly **1.93x speedup for 
`testhive`** and **1.50x for `spark_catalog`**.
   
   ## What
   
   ### Core (one production change)
   - Add `InMemoryCatalog.INSTANCE_NAMESPACE` 
(`in-memory-catalog.instance-namespace`). Instances initialized with the same 
value share namespaces, tables, and views through a JVM-wide store. Needed 
because Spark clones sessions and tests build separate validation catalogs that 
must observe each other's writes.
   - `close()` no longer clears state when the instance is sharing.
   - `clearInstanceNamespace(String)` for explicit test cleanup.
   - New unit tests cover sharing, isolation across distinct namespaces, and 
that `close()` does not wipe shared state.
   
   ### Spark tests
   - `SparkCatalogConfig`: switch `HIVE`, `REST`, `SPARK_SESSION`, 
`SPARK_SESSION_WITH_VIEWS`, `SPARK_WITH_HIVE_VIEWS`, 
`SPARK_SESSION_WITH_UNIQUE_LOCATION`, `HIVE_WITH_UNIQUE_LOCATION` to use 
`InMemoryFileIO`. Add an `INMEMORY` config backed by `InMemoryCatalog` (with 
`INSTANCE_NAMESPACE`).
   - `TestBaseWithCatalog`:
     - Clear stale `spark.sql.catalog.<name>.*` properties from the shared 
`SparkSession` before each test (the static session leaks config across 
parameterizations).
     - When a test's catalog config overrides `io-impl`, build a fresh 
validation `HiveCatalog` with the same `FileIO` so reads round-trip.
     - Initialize the validation `InMemoryCatalog` with the test catalog's 
configuration so they share state.
     - Pass `io-impl=InMemoryFileIO` to the `RESTServerExtension` backend.
   - `CatalogTestBase`: replace the `HADOOP` parameter with `INMEMORY`.
   - `TestDataFrameWriterV2`: parameterize the expected error message by 
catalog name (no longer hardcoded to `testhadoop`).
   
   ### Tests opted out of in-memory I/O
   - `TestCompressionSettings` and `TestMetadataTableReadableMetrics` filter 
`io-impl` out of their parameter properties because they assert against raw 
on-disk Parquet/ORC/Avro files via Hadoop readers (`ParquetFileReader`, 
`OrcFile`, `Files.localOutput`). These continue to use `HadoopFileIO`.
   
   ## Risks / open questions
   
   - This is a **test-only** behavior change in Spark, plus a **public 
addition** to `InMemoryCatalog` (`INSTANCE_NAMESPACE` constant + 
`clearInstanceNamespace`). The new property is opt-in.
   - The change touches two modules. Reviewers may prefer splitting into:
     1. `Core: InMemoryCatalog INSTANCE_NAMESPACE for shared in-process state`
     2. `Spark: Use in-memory FileIO/Catalog in Spark 4.1 tests for faster runs`
     Happy to split if requested.
   - Several other tests in the suite hit pre-existing flakes 
(`TestSparkReadMetrics.testDeleteMetrics` rounds planning duration to 0 under 
in-memory I/O; `TestTimestampWithoutZone` has Hive Metastore namespace ordering 
issues). These pass in isolation; not addressed here.
   
   ## Validation
   
   - `./gradlew :iceberg-core:spotlessCheck 
:iceberg-spark:iceberg-spark-4.1_2.13:spotlessCheck` clean
   - `./gradlew :iceberg-core:test --tests 
org.apache.iceberg.inmemory.TestInMemoryCatalog` passes (the three new tests 
included)
   - `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:compileTestJava` clean
   - `spark.source` package-level run (60 test classes) — pre-existing flakes 
only
   
   ## Draft
   
   Draft PR — running `parallel-reviews` against this for the multi-perspective 
review pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark, Core: Use in-memory FileIO/Catalog in Spark tests for faster runs [iceberg]

Reply via email to