stevenzwu opened a new pull request, #16549:
URL: https://github.com/apache/iceberg/pull/16549

   ## Summary
   
   Reduces `SparkRowLevelOperationsTestBase` parameter rows from 6 to 3 in v4.0 
/ v4.1, and from 7 to 3 in v3.5, shifting from "test every catalog backend" to 
"test the catalogs that matter for production":
   
   - **testhive** (Hive) — kept as the Hive-metastore baseline.
   - **testrest** (REST) — added in place of `testhadoop`. `HadoopCatalog` 
isn't recommended for production; REST is the OSS-strategic catalog.
   - **spark_catalog (REST-backed)** — re-pointed from Hive to REST so the 
`SparkSessionCatalog` row exercises the REST commit path.
   
   The trim affects 9 concrete subclasses that inherit this base: 
`TestCopyOnWriteMerge` / `TestMergeOnReadMerge` / `TestCopyOnWriteUpdate` / 
`TestMergeOnReadUpdate` / `TestCopyOnWriteDelete` / `TestMergeOnReadDelete`, 
both `*MergeMetrics` subclasses, and `TestMergeSchemaEvolution`. Each cuts test 
invocations by **50%** (≈57% on Spark 3.5). `TestCopyOnWriteWithLineage` / 
`TestMergeOnReadWithLineage` are unaffected — 
`TestRowLevelOperationsWithLineage` redeclares `parameters()` locally.
   
   Expected savings: ~6-7 minutes off each `extensions` matrix-cell wallclock 
on CI (out of ~33 min total).
   
   ## Axis coverage — before
   
   v4.0 / v4.1 (6 rows):
   
   | # | catalog | format | vec | dist | branch | planning | fanout | fmtV |
   |---|---|---|---|---|---|---|---|---|
   | 1 | testhive | ORC | true | NONE | MAIN | LOCAL | true | 2 |
   | 2 | testhive | PARQUET | true | NONE | test | DISTRIBUTED | false | 2 |
   | 3 | testhadoop | PARQUET | random | HASH | null | LOCAL | true | 2 |
   | 4 | spark_catalog (Hive) | AVRO | false | RANGE | test | DISTRIBUTED | 
false | 2 |
   | 5 | testhadoop | PARQUET | random | HASH | null | LOCAL | true | 3 |
   | 6 | spark_catalog (Hive) | AVRO | false | RANGE | test | DISTRIBUTED | 
false | 3 |
   
   v3.5 (7 rows): same as v4.x rows 1-4 plus an extra `testhadoop / PARQUET / 
vec=true / HASH / v3` row (Spark 3.5's row 5) and `vec=false` variant (row 6), 
with the v3 spark_catalog row at position 7.
   
   | Axis | Values present (rows) |
   |---|---|
   | Catalog | testhive (1, 2) · testhadoop (3, 5) · spark_catalog/Hive (4, 6) |
   | File format | ORC (1) · PARQUET (2, 3, 5) · AVRO (4, 6) |
   | Vectorized | true (1, 2) · random (3, 5) · false (4, 6) |
   | Distribution | NONE (1, 2) · HASH (3, 5) · RANGE (4, 6) |
   | Branch | MAIN (1) · test (2, 4, 6) · null (3, 5) |
   | Planning | LOCAL (1, 3, 5) · DISTRIBUTED (2, 4, 6) |
   | Fanout | true (1, 3, 5) · false (2, 4, 6) |
   | formatVersion | v2 (1, 2, 3, 4) · v3 (5, 6) |
   
   Two pairs are duplicates on every axis except `formatVersion`: rows 3+5 and 
rows 4+6.
   
   ## Axis coverage — after
   
   | # | catalog | format | vec | dist | branch | planning | fanout | fmtV |
   |---|---|---|---|---|---|---|---|---|
   | 1 | testhive | ORC | true | NONE | MAIN | LOCAL | true | 2 |
   | 2 | testrest (REST) | PARQUET | true | HASH | null | DISTRIBUTED | false | 
2 |
   | 3 | spark_catalog (REST) | AVRO | false | RANGE | test | DISTRIBUTED | 
false | 3 |
   
   | Axis | Values present (rows) |
   |---|---|
   | Catalog | testhive (1) · testrest (2) · spark_catalog/REST (3) |
   | File format | ORC (1) · PARQUET (2) · AVRO (3) |
   | Vectorized | true (1, 2) · false (3) |
   | Distribution | NONE (1) · HASH (2) · RANGE (3) |
   | Branch | MAIN (1) · null (2) · test (3) |
   | Planning | LOCAL (1) · DISTRIBUTED (2, 3) |
   | Fanout | true (1) · false (2, 3) |
   | formatVersion | v2 (1, 2) · v3 (3) |
   
   Every value of every axis is still covered. Three axes (catalog, file 
format, distribution, branch) have exactly one row per value — joint-axis 
coverage is intentionally thinner than before.
   
   ## Design rationale
   
   - **`testhadoop` dropped**: HadoopCatalog isn't a production target; the 
row-level operation paths it exercises (commit through filesystem rename) 
aren't catalog-distinctive enough to justify the CI cost.
   - **REST added (non-session)**: REST is the OSS-strategic catalog. Worth 
direct row-level coverage, not just the SessionCatalog wrapper.
   - **`spark_catalog` repointed to REST**: The SessionCatalog wrapper test is 
more valuable when it exercises the REST commit path than when it duplicates 
Hive coverage already provided by row 1.
   - **`formatVersion` collapsed to one v3 row**: v2 is covered twice 
(testhive, testrest); v3 covered once. v3 introduces DV semantics that 
`validateSnapshot` checks via `formatVersion >= 3`, so at least one v3 row is 
required.
   
   ## Known issues to follow up
   
   A local smoke run of `TestCopyOnWriteDelete` exposed 3 failures on the 
REST-backed rows that are pre-existing test-fixture assumptions, not row-level 
operation bugs:
   
   1. **`testDeleteWithoutScanningTable`** (testrest + spark_catalog/REST, 2 
failures). Throws `IllegalArgumentException: Missing scheme` from 
`TestBase.move()`, which calls `Path.of(URI.create(location))`. Hive's manifest 
paths come back with `file://` schemes; the REST catalog returns local paths 
without a scheme.
   2. **`testDeleteFileThenMetadataDelete`** (testrest, 1 failure). Asserts a 
data file count after metadata-delete; the REST commit path produces a 
different file-set shape than Hive does.
   
   Both are infrastructure gaps surfaced by switching this test from testhive 
to testrest. They need targeted fixes (in `TestBase.move()` and the 
metadata-delete assertion) — out of scope for this PR.
   
   ## Test plan
   
   - [x] `spotlessCheck` + `compileTestJava` pass on all 3 Spark versions.
   - [x] Local smoke run of `TestCopyOnWriteDelete` against Spark 4.1 (132 
invocations vs 264 originally; 50% reduction confirmed; pre-existing REST 
fixture failures noted above).
   - [ ] Full Spark CI run on this branch.
   - [ ] Verify no test method ends up with zero matching rows after the trim — 
quick `git grep "assumeThat"` showed no stacked predicates that would silence a 
test, but warrants a final pass.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to