stevenzwu opened a new pull request, #16559:
URL: https://github.com/apache/iceberg/pull/16559

   ## Summary
   
   Reduces `TestStructuredStreamingRead3` parameter rows from 8 to 2 in v3.5 / 
v4.0 / v4.1, dropping the catalogs that don't add meaningful coverage for 
streaming-read semantics:
   
   - **testhive** (Hive, async=true) — Hive-metastore baseline.
   - **testrest** (REST, async=false) — REST is the OSS-strategic catalog.
   
   Removed: `testhadoop` (HadoopCatalog isn't recommended for production) and 
`spark_catalog` (SessionCatalog wrapper differences live in 
DDL/table-resolution paths, not streaming reads — both `SparkCatalog` and 
`SparkSessionCatalog` resolve to `SparkTable` once a table is identified as 
Iceberg, and streaming reads don't exercise the wrapper-specific code paths).
   
   The trim cuts each test class invocation count from **264 → 66 (75% 
reduction)**.
   
   `TestStructuredStreamingRead3` was the **highest-CPU test class in the Spark 
core CI job** (20.3% of total test CPU on `(17, 4.1, 2.13, core)`, ~931 CPU-sec 
out of 4595). Per-row coverage analysis shows no test method uses 
`assumeThat(catalogName)`, so no test gets silenced by the trim — every test 
still runs across both rows.
   
   ## Axis coverage — before
   
   | # | catalog | async |
   |---|---|---|
   | 1 | testhive | false |
   | 2 | testhive | true |
   | 3 | testhadoop | false |
   | 4 | testhadoop | true |
   | 5 | testrest | false |
   | 6 | testrest | true |
   | 7 | spark_catalog (Hive) | false |
   | 8 | spark_catalog (Hive) | true |
   
   | Axis | Values present (rows) |
   |---|---|
   | Catalog | testhive (1, 2) · testhadoop (3, 4) · testrest (5, 6) · 
spark_catalog (7, 8) |
   | async | false (1, 3, 5, 7) · true (2, 4, 6, 8) |
   
   ## Axis coverage — after
   
   | # | catalog | async |
   |---|---|---|
   | 1 | testhive | true |
   | 2 | testrest | false |
   
   | Axis | Values present (rows) |
   |---|---|
   | Catalog | testhive (1) · testrest (2) |
   | async | true (1) · false (2) |
   
   Both async values still tested; both strategic production catalogs still 
tested. Joint coverage of `(testhive, false)` and `(testrest, true)` is 
sacrificed — but no test in the class depends on those joint combinations 
specifically (no `assumeThat` on `catalogName` or stacked predicates).
   
   ## Design rationale
   
   - **Streaming read semantics aren't catalog-specific.** A streaming read on 
an Iceberg table goes through Spark's micro-batch source machinery, the Iceberg 
`SparkTable` DSv2 read path, and the `IncrementalScan` planning logic. The 
catalog backend is only involved at table-resolution time. Once the table is 
loaded, the streaming behavior is the same regardless of catalog.
   - **`testhadoop` dropped.** `HadoopCatalog` isn't a production target for 
Iceberg. The streaming tests don't exercise anything HadoopCatalog-specific.
   - **`spark_catalog` dropped.** The `SparkSessionCatalog` wrapper test is 
more valuable when it exercises code paths the wrapper actually intercepts (DDL 
routing, V1-vs-V2 fallback for non-Iceberg tables) — none of which streaming 
reads do.
   - **REST kept at `async=false`, Hive at `async=true`** — distributes both 
async values across both catalogs, no implicit "one catalog runs both async 
modes" preference.
   
   ## Test plan
   
   - [x] `spotlessCheck` passes on all 3 Spark versions.
   - [x] Local smoke run on Spark 4.1: `TestStructuredStreamingRead3` — **66 
tests, 0 skipped, 0 failures, 0 errors** (down from 264 invocations originally; 
75% reduction confirmed).
   - [x] Verified no test method uses `assumeThat(catalogName)` — `git grep 
"assumeThat"` in the file returned 0 matches, so the catalog axis trim cannot 
silence any test.
   - [ ] Full Spark CI run on this branch.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to