stevenzwu opened a new pull request, #16549: URL: https://github.com/apache/iceberg/pull/16549
## Summary Reduces `SparkRowLevelOperationsTestBase` parameter rows from 6 to 3 in v4.0 / v4.1, and from 7 to 3 in v3.5, shifting from "test every catalog backend" to "test the catalogs that matter for production": - **testhive** (Hive) — kept as the Hive-metastore baseline. - **testrest** (REST) — added in place of `testhadoop`. `HadoopCatalog` isn't recommended for production; REST is the OSS-strategic catalog. - **spark_catalog (REST-backed)** — re-pointed from Hive to REST so the `SparkSessionCatalog` row exercises the REST commit path. The trim affects 9 concrete subclasses that inherit this base: `TestCopyOnWriteMerge` / `TestMergeOnReadMerge` / `TestCopyOnWriteUpdate` / `TestMergeOnReadUpdate` / `TestCopyOnWriteDelete` / `TestMergeOnReadDelete`, both `*MergeMetrics` subclasses, and `TestMergeSchemaEvolution`. Each cuts test invocations by **50%** (≈57% on Spark 3.5). `TestCopyOnWriteWithLineage` / `TestMergeOnReadWithLineage` are unaffected — `TestRowLevelOperationsWithLineage` redeclares `parameters()` locally. Expected savings: ~6-7 minutes off each `extensions` matrix-cell wallclock on CI (out of ~33 min total). ## Axis coverage — before v4.0 / v4.1 (6 rows): | # | catalog | format | vec | dist | branch | planning | fanout | fmtV | |---|---|---|---|---|---|---|---|---| | 1 | testhive | ORC | true | NONE | MAIN | LOCAL | true | 2 | | 2 | testhive | PARQUET | true | NONE | test | DISTRIBUTED | false | 2 | | 3 | testhadoop | PARQUET | random | HASH | null | LOCAL | true | 2 | | 4 | spark_catalog (Hive) | AVRO | false | RANGE | test | DISTRIBUTED | false | 2 | | 5 | testhadoop | PARQUET | random | HASH | null | LOCAL | true | 3 | | 6 | spark_catalog (Hive) | AVRO | false | RANGE | test | DISTRIBUTED | false | 3 | v3.5 (7 rows): same as v4.x rows 1-4 plus an extra `testhadoop / PARQUET / vec=true / HASH / v3` row (Spark 3.5's row 5) and `vec=false` variant (row 6), with the v3 spark_catalog row at position 7. | Axis | Values present (rows) | |---|---| | Catalog | testhive (1, 2) · testhadoop (3, 5) · spark_catalog/Hive (4, 6) | | File format | ORC (1) · PARQUET (2, 3, 5) · AVRO (4, 6) | | Vectorized | true (1, 2) · random (3, 5) · false (4, 6) | | Distribution | NONE (1, 2) · HASH (3, 5) · RANGE (4, 6) | | Branch | MAIN (1) · test (2, 4, 6) · null (3, 5) | | Planning | LOCAL (1, 3, 5) · DISTRIBUTED (2, 4, 6) | | Fanout | true (1, 3, 5) · false (2, 4, 6) | | formatVersion | v2 (1, 2, 3, 4) · v3 (5, 6) | Two pairs are duplicates on every axis except `formatVersion`: rows 3+5 and rows 4+6. ## Axis coverage — after | # | catalog | format | vec | dist | branch | planning | fanout | fmtV | |---|---|---|---|---|---|---|---|---| | 1 | testhive | ORC | true | NONE | MAIN | LOCAL | true | 2 | | 2 | testrest (REST) | PARQUET | true | HASH | null | DISTRIBUTED | false | 2 | | 3 | spark_catalog (REST) | AVRO | false | RANGE | test | DISTRIBUTED | false | 3 | | Axis | Values present (rows) | |---|---| | Catalog | testhive (1) · testrest (2) · spark_catalog/REST (3) | | File format | ORC (1) · PARQUET (2) · AVRO (3) | | Vectorized | true (1, 2) · false (3) | | Distribution | NONE (1) · HASH (2) · RANGE (3) | | Branch | MAIN (1) · null (2) · test (3) | | Planning | LOCAL (1) · DISTRIBUTED (2, 3) | | Fanout | true (1) · false (2, 3) | | formatVersion | v2 (1, 2) · v3 (3) | Every value of every axis is still covered. Three axes (catalog, file format, distribution, branch) have exactly one row per value — joint-axis coverage is intentionally thinner than before. ## Design rationale - **`testhadoop` dropped**: HadoopCatalog isn't a production target; the row-level operation paths it exercises (commit through filesystem rename) aren't catalog-distinctive enough to justify the CI cost. - **REST added (non-session)**: REST is the OSS-strategic catalog. Worth direct row-level coverage, not just the SessionCatalog wrapper. - **`spark_catalog` repointed to REST**: The SessionCatalog wrapper test is more valuable when it exercises the REST commit path than when it duplicates Hive coverage already provided by row 1. - **`formatVersion` collapsed to one v3 row**: v2 is covered twice (testhive, testrest); v3 covered once. v3 introduces DV semantics that `validateSnapshot` checks via `formatVersion >= 3`, so at least one v3 row is required. ## Known issues to follow up A local smoke run of `TestCopyOnWriteDelete` exposed 3 failures on the REST-backed rows that are pre-existing test-fixture assumptions, not row-level operation bugs: 1. **`testDeleteWithoutScanningTable`** (testrest + spark_catalog/REST, 2 failures). Throws `IllegalArgumentException: Missing scheme` from `TestBase.move()`, which calls `Path.of(URI.create(location))`. Hive's manifest paths come back with `file://` schemes; the REST catalog returns local paths without a scheme. 2. **`testDeleteFileThenMetadataDelete`** (testrest, 1 failure). Asserts a data file count after metadata-delete; the REST commit path produces a different file-set shape than Hive does. Both are infrastructure gaps surfaced by switching this test from testhive to testrest. They need targeted fixes (in `TestBase.move()` and the metadata-delete assertion) — out of scope for this PR. ## Test plan - [x] `spotlessCheck` + `compileTestJava` pass on all 3 Spark versions. - [x] Local smoke run of `TestCopyOnWriteDelete` against Spark 4.1 (132 invocations vs 264 originally; 50% reduction confirmed; pre-existing REST fixture failures noted above). - [ ] Full Spark CI run on this branch. - [ ] Verify no test method ends up with zero matching rows after the trim — quick `git grep "assumeThat"` showed no stacked predicates that would silence a test, but warrants a final pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
