stevenzwu commented on issue #16694: URL: https://github.com/apache/iceberg/issues/16694#issuecomment-4640623879
# V4 Adaptive Metadata Tree — Write Path (Full v3 Parity) ## Context Iceberg v4 [spec PR #16025](https://github.com/apache/iceberg/pull/16025) replaces the v3 manifest list with a single **root manifest** per snapshot. Root manifests use a unified `content_entry` schema that can reference leaf data/delete manifests (as `DATA_MANIFEST` / `DELETE_MANIFEST` entries) and may also hold direct data file entries. Leaf data manifests use the same `content_entry` schema, with optional co-located DV pointers in the `deletion_vector` field. v4 forbids `content_type=POSITION_DELETES` (the field id is reserved but never written) — DVs become a colocated property of the data file's content_entry, not a separate file class. Today, v4 tables already route through dedicated writers (`ManifestListWriter.V4Writer`, `ManifestWriter.V4Writer`, `ManifestWriter.V4DeleteWriter`), but those writers still emit **v3-shape output** (manifest list + v3-style `manifest_entry` rows, just tagged `format-version=4`). Per-version schema interfaces for `content_entry` already exist as standalone, unwired Java types: `TrackedFile`, `Tracking`, `ManifestInfo`, `DeletionVector` (each with a `*Struct` impl + builder). [PR #16677](https://github.com/apache/iceberg/pull/16677) decouples `DEFAULT_MAX_TEST_FORMAT_VERSION = 3` from `MAX_FORMAT_VERSION = 4` so we can churn v4 internals without disturbing the broad parameterized test suite. [PR #16100](https://github.com/apache/iceberg/pull/16100) adds `TrackedFileAdapters` — read-direction adapters that present `TrackedFile` entries (including colocated DVs) behind the legacy `DataFile` / `DeleteFile` interfaces, keeping `FileScanTask` / `ManifestGroup` / `DeleteFileIndex` larg ely unchanged. This plan implements the **full write path** for v4: root manifests + content_entry leaf manifests, colocated DV writes, all `SnapshotProducer` subclasses (Append, Overwrite, RowDelta, RewriteFiles, RewriteManifests, ReplacePartitions), engine integration in Spark and Flink, and full re-enablement of the parameterized test suite at v4. The goal is parity with the v3 commit surface, not just a minimal substrate. ## Approach Ten phases. Each phase produces an independently reviewable PR (or small PR series). Phases 1–5 establish the on-disk format and unwire SnapshotProducer's manifest-list dependency. Phase 6 rewrites the row-delta path for colocated DVs. Phases 7–9 extend that to read planning, the remaining SnapshotProducer subclasses, and engine writers. Phase 10 re-enables v4 in the broad parameterized test suite. **Prereqs**: - [PR #16677](https://github.com/apache/iceberg/pull/16677) lands first — without it, every phase will have collateral test failures unrelated to the v4 work. - [PR #16100](https://github.com/apache/iceberg/pull/16100) lands by Phase 2 — `ContentEntryReader` builds on `TrackedFileAdapters` rather than reinventing the TrackedFile→DataFile/DeleteFile projection. - [PR #16689](https://github.com/apache/iceberg/pull/16689) (`EntryStatus.MODIFIED` + `TrackingBuilder` status derivation) lands first — Phases 2 and 6 depend on `MODIFIED` and the `REPLACED`/`MODIFIED` pairing semantics. **Critical decisions baked in**: - `BaseSnapshot` stores `formatVersion` so `cacheManifests()` can dispatch between `ManifestLists.read()` and `RootManifests.read()` deterministically. - v4 leaf manifests written by Phase 2 are 100% content_entry — the existing `format-version=4` Parquet header becomes a clean read-dispatch signal (no extra header). - DVs are projected back to legacy `DeleteFile(POSITION_DELETES)` via `TrackedFileAdapters` on the read side; downstream scan planning sees the v3-shape layout it already understands. - All five `SnapshotProducer` subclasses inherit a single v4 commit path through `MergingSnapshotProducer`. Subclass-specific work in Phase 8 is small: most subclasses just work once the leaf writer, root-manifest path, and DV pairing are in place. --- ### Phase 1: Add `writer_format_version` field + adapter wrappers (no behavior change) Add the missing required field to `TrackedFile` (per spec PR #16025: field id **157**, `writer_format_version`, int, required for both write and read; v4 writers must produce value `1`; entries that reference legacy pre-v4 leaf manifests use value `0`): - `core/.../TrackedFile.java`: add `WRITER_FORMAT_VERSION = Types.NestedField.required(157, "writer_format_version", Types.IntegerType.get(), ...)` and include it in `schemaWithContentStats(...)`. Add `writerFormatVersion()` accessor. - `core/.../TrackedFileStruct.java`: add the field to the positional `StructLike` impl + builder. - Update `core/src/test/java/org/apache/iceberg/TestTrackedFile.java` and `TestTrackedFileStruct.java` to cover the new field. Then build the write-direction wrappers around the now-complete schema: - `core/.../ContentEntryWrapper.java` (new) — `StructLike` writer wrapper analogous to `V4Metadata.ManifestEntryWrapper` (`core/.../V4Metadata.java:316`), but mapping the `content_entry` field layout (TRACKING + CONTENT_TYPE + WRITER_FORMAT_VERSION + LOCATION + FILE_FORMAT + ... + DELETION_VECTOR + MANIFEST_INFO + ...) for positional access during Parquet write. - `core/.../ContentEntryAdapter.java` (new) — converts `ContentFile<?>` + `ManifestEntry` status/seq/snapshot fields + optional DV reference into a `TrackedFileStruct` with `writer_format_version=1`. Uses existing `TrackingBuilder` for the tracking nested struct, and `StatsUtil.contentStatsFor(schema)` for content_stats. For root-manifest entries that reference legacy v3 leaf manifests (the upgrade case in Phase 5), the adapter accepts an explicit override to set `writer_format_version=0`. Also provides a method to populate the `deletion_vector` struct when a position-delete `DeleteFile` is being collapsed into a colocated DV (used by Phase 6). Reuse `TrackedFile.schemaWithContentStats(partitionType, contentStatsType)` as the canonical content_entry schema — no new schema helper class. Reverse-direction conversion is provided by `TrackedFileAdapters` (PR #16100); do not duplicate. Reused utilities (do not duplicate): - `TrackedFile.schemaWithContentStats(...)` — canonical schema builder. - `TrackingBuilder.added(snapshotId)` / `from(source, snapshotId)` / `replaced(...)` / `modified(...)` — status derivation already accounts for REPLACED/MODIFIED. - `TrackedFileStruct`, `TrackingStruct`, `ManifestInfoStruct`, `DeletionVectorStruct` — concrete impls already exist with builders. - `StatsUtil.contentStatsFor(schema)` — schema → content_stats nested type. - `MetricsUtil.fromMetrics(schema, metrics)` — `Metrics` → `ContentStats`. - `PartitionSpec.rawPartitionType()` — partition struct type. **Files**: 2 modified (`TrackedFile`, `TrackedFileStruct`) + 2 test updates + 2 new (`ContentEntryWrapper`, `ContentEntryAdapter`). **No public API changes; no callers wired yet** — these are validated by Phase 2's round-trip tests. --- ### Phase 2: Replace v4 leaf manifest writers + ship matching reader Both write and read for content_entry leaf manifests must land together — `TestManifestWriterVersions.testV4Write*` is round-trip and breaks the moment the writer changes shape without a matching reader. - Rewrite `ManifestWriter.V4Writer` (`core/.../ManifestWriter.java:271-310`) and `ManifestWriter.V4DeleteWriter` (`core/.../ManifestWriter.java:312-355`) to emit `content_entry` rows via `ContentEntryWrapper`, using `TrackedFile.schemaWithContentStats(spec.rawPartitionType(), StatsUtil.contentStatsFor(schema).type().asStructType())`. All emitted rows carry `writer_format_version=1`. Output stays Parquet (already the v4 default in `SnapshotProducer.newManifestOutputFile()` at `core/.../SnapshotProducer.java:148-151`). - Both writers continue to emit the existing Parquet metadata key `format-version=4` for compatibility/auditing, but **read-dispatch does not depend on it**. See dispatch design below. - Fix `ManifestWriter.addEntry()`'s status counters to handle `REPLACED` and `MODIFIED` correctly — both are needed by Phase 6's DV updates. `REPLACED` increments `replacedFiles`/`replacedRows`; `MODIFIED` increments live counters. `toManifestFile()` populates the corresponding `ManifestInfo` slots. - New `core/.../ContentEntryReader.java` — reads `content_entry` Parquet rows into `TrackedFileStruct`, then uses `TrackedFileAdapters` (PR #16100) to project each entry to the appropriate legacy view: `asDataFile(...)` for `content_type=DATA` without a DV, `asDataFile(...)` + `asDVDeleteFile(...)` for `content_type=DATA` with a colocated DV, `asEqualityDeleteFile(...)` for `content_type=EQUALITY_DELETES`. Yields `ManifestEntry<DataFile>` / `ManifestEntry<DeleteFile>` so all downstream consumers (`ManifestGroup`, `DeleteFileIndex`, `MergingSnapshotProducer`'s manifest-rewrite paths) work unchanged. **Read-dispatch design (layered, no file-level guessing in production paths)** Dispatch is fully determined by where in the manifest tree you are. No production caller inspects file metadata to decide which reader to use. 1. **Snapshot → manifest tree.** Determined by which JSON key the snapshot has: `manifest-list` (v1–v3) or `root-manifest` (v4). Set in Phase 4. `BaseSnapshot.cacheManifests()` branches on `formatVersion`. 2. **Root manifest read.** A snapshot with `root-manifest` always points to a content_entry-shape root. `RootManifestReader` is used unconditionally — no per-file dispatch. 3. **Root → leaf manifest read.** Each manifest-reference content_entry in the root carries `writer_format_version`: `1` (v4 leaf, content_entry schema) or `0` (legacy v3 leaf carried over during a v3→v4 upgrade). `RootManifestReader` calls `ManifestFiles.read(manifest, io, specsById, writerFormatVersion)` which routes to `ContentEntryReader` or legacy `ManifestReader` accordingly. 4. **v3 manifest-list paths.** Snapshot has `manifest-list` → entries point to legacy v3 manifests → legacy `ManifestReader` unconditionally. No dispatch needed because v3 manifest lists never reference v4-shape leaves. This means `ManifestFiles.read(...)` (`core/.../ManifestFiles.java:152`) and `readDeleteManifest(...)` (line 320) gain an optional `writerFormatVersion` parameter. All snapshot-rooted callers supply it: `BaseSnapshot.cacheManifests`, `RootManifestReader`, `BaseRewriteManifests`, `RemoveSnapshots` / `IncrementalFileCleanup` / `ReachableFileUtil` walks. None of these paths inspect file metadata. **Fallback (non-production paths only).** When a caller opens a manifest file outside the snapshot tree (tests writing-then-reading without committing, ad-hoc tooling), it can omit the hint. The reader then peeks at the Parquet footer schema and dispatches on the presence of field id 134 (`content_type`) or 147 (`tracking`) — the schema is more self-describing than a metadata key. Single footer read, cached on the reader. **Why not dispatch on `writer_format_version` (field 157) at the file level?** It's a per-row field. To inspect it you'd have to commit to a schema first, which is circular. The hint-from-caller path uses the same value but supplies it externally; the fallback uses schema shape (where the field lives) without parsing rows. - Update existing v4 tests in `core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java` to assert `content_entry` shape on disk (including `writer_format_version=1`) and that the round-trip preserves all data file fields. Add a test that explicitly exercises the schema-shape fallback path (read without hint). **Files**: ~5 modified, 1 new. **Tests**: existing `TestManifestWriterVersions` v4 cases expanded; new `TestContentEntryReader` round-trip. --- ### Phase 3: RootManifestWriter and RootManifestReader - `core/.../RootManifests.java` (new) — factory mirroring `ManifestLists.java`, with `write(formatVersion, outputFile, ...)` and `read(inputFile)` entry points. - `core/.../RootManifestWriter.java` (new) — accepts `ManifestFile` objects (representing manifest references), emits `content_entry` rows with `content_type=DATA_MANIFEST` (3) or `DELETE_MANIFEST` (4) and `manifest_info` populated from each `ManifestFile`'s counts and min sequence number. Uses `ManifestInfoStruct.Builder`. Sets `writer_format_version=1` for entries referencing v4 leaf manifests; sets `writer_format_version=0` for entries referencing legacy v3 leaf manifests carried over during a v3→v4 upgrade. Also accepts direct data-file entries (`content_type=DATA` with optional DV) for callers that want to skip a leaf manifest. Emits `format-version=4` Parquet header. Output extension `.parquet`. - `core/.../RootManifestReader.java` (new) — reads `content_entry` rows: for `content_type ∈ {3, 4}` reconstructs `GenericManifestFile` objects (so downstream `ManifestGroup` / `DeleteFileIndex` work unchanged); for `content_type ∈ {0, 2}` projects via `TrackedFileAdapters` and synthesizes a single-entry `GenericManifestFile` so the data file flows through the same downstream pipeline. The synthesis keeps `ManifestGroup`'s API contract intact while supporting the small-write optimization on read. - New `core/src/test/java/org/apache/iceberg/TestRootManifest.java` — round-trip tests: write a small set of manifest references and direct data-file entries, read back, verify counts and identity. **Files**: 3 new + 1 test file. **No callers yet** — wired in Phase 5. --- ### Phase 4: Snapshot plumbing for `root-manifest` - `api/src/main/java/org/apache/iceberg/Snapshot.java`: add `default String rootManifestLocation() { return null; }`. Default-null (not a new sub-interface) matches the precedent of `schemaId()` / `firstRowId()` / `addedRows()` / `keyId()`. - `core/src/main/java/org/apache/iceberg/BaseSnapshot.java`: add `rootManifestLocation` field + new constructor overload. **Also add a `formatVersion` field** so `cacheManifests()` (`core/.../BaseSnapshot.java:186` region — the unconditional `ManifestLists.read(...)` call) can dispatch deterministically. v1–v3 snapshots store `manifestListLocation`; v4 snapshots store `rootManifestLocation`. Constructor enforces exactly one is non-null. - `core/src/main/java/org/apache/iceberg/SnapshotParser.java`: add `ROOT_MANIFEST = "root-manifest"` constant. In `toJson`, write `root-manifest` when v4+ and `manifest-list` otherwise. In `fromJson`, accept either key; `formatVersion` is passed in via the table metadata, which is already available at parse time. - `cacheManifests()` in `BaseSnapshot`: branch on `formatVersion >= 4` → `RootManifests.read(io.newInputFile(rootManifestLocation))`; else existing `ManifestLists.read(...)`. Both produce `List<ManifestFile>` so `ManifestGroup` is untouched. **Files**: 3 modified (1 API, 2 core). **Tests**: `TestSnapshotJson` adds v4 root-manifest round-trip; new `TestBaseSnapshotV4` covers the cacheManifests branch. --- ### Phase 5: SnapshotProducer integration - `core/.../SnapshotProducer.java`: - `manifestListPath()` (line 615) stays for v1–v3. - Add `rootManifestPath()` that produces `<metadata>/snap-<id>-<attempt>-<UUID>.parquet`. - In `apply()` (line 297), branch on `base.formatVersion() >= 4`: - v4 path: open `RootManifests.write(4, rootManifestPath(), ...)`, feed it the `manifests` list from `apply(base, parentSnapshot)`. Each `ManifestFile` becomes a `DATA_MANIFEST` / `DELETE_MANIFEST` content_entry. Build `BaseSnapshot` with `rootManifestLocation = rootManifestPath().location()` and `manifestListLocation = null`. - v3 path: existing behavior unchanged. - v3→v4 upgrade migration (first commit on a v3 table that's been upgraded to v4): the previous snapshot's manifests are read via `ManifestLists.read(parent.manifestListLocation())` and propagated through `apply(base, parentSnapshot)` as `EXISTING` manifest references with `writer_format_version=0`. No special-case branch needed. - For appends, file removals (copy-on-write leaf rewrite), and overwrites the existing `MergingSnapshotProducer` flow works untouched: append paths add new leaf manifests via `newManifestWriter()` (already polymorphic on format version, fixed in Phase 2); file removal rewrites the affected leaf manifest in copy-on-write style; new leaf becomes `ADDED`, prior leaf reference is omitted. **Files**: 1 modified. **Tests**: new `core/src/test/java/org/apache/iceberg/TestV4SnapshotProducer.java` covers append, file-removal delete, and overwrite end-to-end. --- ### Phase 6: Colocated DV writes in `MergingSnapshotProducer` Largest behavioral change. v4 spec forbids `content_type=POSITION_DELETES`; DVs live in the data file's `content_entry.deletion_vector` struct. When a snapshot adds or changes a DV on a data file already in a leaf manifest, that leaf is rewritten in copy-on-write style: the prior entry becomes `REPLACED`, the new entry (same data file location, populated `deletion_vector` struct) becomes `MODIFIED`. Both share the same row identity but different status/dv state. - `core/.../MergingSnapshotProducer.java`: - On v4, position-delete `DeleteFile` inputs (collected today in `dvsByReferencedFile`, lines 69-100 region) are not emitted as separate delete-manifest entries. Instead, for each affected data file: locate the data file's current leaf manifest, mark the existing data-file content_entry as `REPLACED` and add a new content_entry (same location, same partition, same stats) with `MODIFIED` status and `deletion_vector` populated from the input DV's Puffin reference (`ContentEntryAdapter.withDeletionVector(...)` from Phase 1). - When the input is a fresh data file with a DV from the same snapshot (i.e., the new file is born with a DV), a single `ADDED` content_entry with populated `deletion_vector` is emitted — no REPLACED/MODIFIED pair. - `validateDeleteFileForVersion()` (already rejects `POSITION_DELETES` files at v4) becomes a routing decision rather than an error: callers may still hand position-delete `DeleteFile` objects to v4 RowDelta; the producer collapses them into colocated DVs instead of failing. - `RowDelta.validateAddedDataFiles` / `validateAddedDeleteFiles`: extend the v4 branch to detect write-write conflicts on DV state — if a concurrent snapshot also produced REPLACED/MODIFIED for the same data-file row, fail with `ValidationException`. - `core/.../ManifestFilterManager.java`: when computing the carried-over manifests, recognize that a leaf manifest with REPLACED/MODIFIED rows produced by this commit has been rewritten — emit only the new manifest reference at root level. - Update the SnapshotSummary counters to track `REPLACED` rows and DV cardinality changes. **Files**: ~4 modified, 0 new. **Tests**: new `TestV4ColocatedDV` covers (a) add DV to existing data file, (b) replace existing DV, (c) data file born with DV, (d) concurrent DV updates conflict. Existing `TestRowDelta` cases parameterized for v4 in Phase 10. **Risk**: high. This rewrites the heart of the row-delta path, and the REPLACED/MODIFIED pairing introduces new commit-conflict semantics. The single biggest reviewer attention area. --- ### Phase 7: Read-path DV resolution wiring `TrackedFileAdapters` (PR #16100) already projects colocated DVs back to legacy `DeleteFile(POSITION_DELETES)` shape on the read side, so most of the scan planning chain works without changes. This phase is the small set of glue updates needed to route those synthesized delete files correctly. - `core/.../DeleteFileIndex.java`: when `ContentEntryReader` yields synthesized DV `DeleteFile` objects from a data manifest (rather than a separate delete manifest), `DeleteFileIndex.builderFor(...)` must group them by `referencedDataFile()` exactly as it does for v3 DVs. Verify that the existing grouping logic handles DVs that arrive interleaved with their target data files in the same manifest read. - `core/.../ManifestGroup.java`: confirm that filtering on a v4 data manifest's content_entry rows correctly applies to the projected DataFile (not the synthesized DV DeleteFile). The simplest invariant: the partition/stats filter is evaluated on the DataFile projection only; the DV is attached after the data file passes filtering. - `core/.../ScanPlanningMode.java` (if applicable): no changes — single-pass planning is unchanged because data and delete content arrive together in the data manifest. - `core/.../BaseTableScan.java`: smoke check that `planFiles()` produces the same `FileScanTask` shape for v3 and v4 tables under the same workload. **Files**: ~2 modified (mostly verification, small adjustments). **Tests**: parameterize representative scan-planning tests at v4 in this phase (selected subset, full re-enablement in Phase 10). --- ### Phase 8: All `SnapshotProducer` subclasses through v4 Most subclasses inherit `MergingSnapshotProducer` and come along for free once Phases 5–6 land. This phase is the targeted work to make the rest pass v4 parameterization. - `MergeAppend`, `FastAppend`, `BaseOverwriteFiles`, `BaseRowDelta`, `BaseReplacePartitions`: work transparently because they call `newManifestWriter()` / `newDeleteManifestWriter()` (polymorphic on format version, fixed in Phase 2) and feed `apply(base, parentSnapshot)`'s `List<ManifestFile>` to `SnapshotProducer.apply()` (branches on v4 in Phase 5). - `core/.../BaseRewriteFiles.java`: rewrite-assertion logic compares before/after manifests against `EntryStatus.ADDED`/`EXISTING`. Extend to admit `REPLACED`/`MODIFIED` for v4 (where rewriting a data file with a DV update is a valid intermediate state). - `core/.../BaseRewriteManifests.java`: operates at the manifest level and produces new manifest references. Ensure the rewritten manifests carry correct `writer_format_version` and `manifest_info` when re-emitted into a root manifest. Likely a small targeted change in the result-construction code path. - `core/.../RemoveSnapshots.java`, `core/.../IncrementalFileCleanup.java`, `core/.../ReachableFileUtil.java`: read-only walks over the manifest tree must follow root manifests for v4 snapshots and manifest lists for v1–v3. `BaseSnapshot.cacheManifests()` (Phase 4) handles this transparently for most callers; check direct readers of `manifestListLocation()` and update them to use the v4-aware `dataManifests(io)` / `deleteManifests(io)` accessors instead. **Files**: ~5 modified. **Tests**: parameterize `TestRewriteFiles`, `TestRewriteManifests`, `TestRemoveSnapshots`, `TestReplacePartitions` at v4 (selected representative cases in this phase, full coverage in Phase 10). --- ### Phase 9: Engine integration (Spark, Flink) Engine row-delta paths today produce `DeleteFile` objects with `FileContent.POSITION_DELETES` — Phase 6 now collapses those into colocated DVs at the catalog layer. So in principle no engine change is required for correctness. But two things still need attention: - **Spark**: - `spark/.../SparkPositionDeltaWriter.java`, `spark/.../SparkFileWriterFactory.java`: confirm the position-delete writer still emits a Puffin DV blob (existing v3 behavior) and that the resulting `DeleteFile` flows through `RowDelta.addRows()/addDeletes()` cleanly. The catalog layer collapses it; the engine doesn't need to know. - `spark/.../SparkWriteConf.java`, `spark/.../SparkWriteRequirements.java`: add a v4-aware branch that surfaces "use DVs" as the default for v4 (matches current v3 default when `write.delete.mode=merge-on-read` with DVs). - `spark/.../actions/RewriteDataFilesSparkAction.java`, `RewritePositionDeleteFilesSparkAction.java`: verify v4 semantics — rewrite-position-delete files becomes "compact DVs" (since v4 only has DVs, not position-delete files). The action may simplify or no-op on v4. - **Flink**: - `flink/.../sink/IcebergFilesCommitter.java`, `flink/.../sink/RowDataTaskWriterFactory.java`: same shape as Spark — ensure existing position-delete output flows through to the v4 collapse path in `MergingSnapshotProducer`. - `flink/.../source/IcebergSource.java`: read-side gets DVs via `TrackedFileAdapters` projection — no change needed. **Files**: ~3-5 modified Spark, ~2-3 modified Flink. **Risk**: medium. The engine plumbing looks correct on inspection, but the position-delete writer paths have many integration tests that will exercise the v4 collapse end-to-end in Phase 10. --- ### Phase 10: Re-enable v4 in broad parameterized tests Now this is a deliverable, not a triage exercise. - Bump `DEFAULT_MAX_TEST_FORMAT_VERSION` to 4 in `api/src/test/java/org/apache/iceberg/TestHelpers.java` (constant from #16677). This re-enables `TestBase`-derived suites at v4: `TestMergeAppend`, `TestOverwrite`, `TestRewriteFiles`, `TestReplacePartitions`, `TestRowDelta`, `TestRemoveSnapshots`, plus engine equivalents in `data/`, `spark/`, and `flink/`. - Triage and fix: - Tests that assert v4 manifests have a manifest-list path → update to assert `rootManifestLocation`. - Tests that read manifest list bytes directly → switch to `RootManifestReader`. - Tests asserting `DeleteFile(POSITION_DELETES)` shape on disk → update to assert colocated DV in the data manifest. - Spec-conformance tests for the on-disk format: roaring bitmap encoding for manifest DVs, exact field IDs, optional/required field handling, REPLACED+MODIFIED pair invariants. - Migration tests: v3→v4 upgrade with snapshots that have position-delete files. These must read correctly under the v4 read path (handled by the v4-aware `cacheManifests()` from Phase 4 plus the `writer_format_version=0` content references from Phase 5) even though they can never be re-emitted as v4 writes. - Each batch of test fixes is a small follow-up PR within Phase 10. **Files**: 1 modified + dozens of test-only adjustments. **Risk**: medium — test churn dominates, but no new architectural risk. --- ## Out of scope (explicit follow-ups) 1. **Direct data-file entries in the root manifest as a default behavior** (the small-write optimization). Phase 3's writer accepts them; no commit path emits them by default. A follow-up adds a heuristic in `SnapshotProducer.apply()` to skip leaf-manifest creation for sufficiently small writes. 2. **Manifest delete vectors at the root level.** v4 spec allows marking a leaf manifest's row as deleted via a manifest DV stored in the root, avoiding leaf-manifest rewrite. All deletes in this plan still go through copy-on-write leaf rewrites. Adding root-level manifest DVs is a write-amplification optimization for high-churn tables. 3. **Manifest compaction during commit.** v4 spec allows compacting existing manifests as part of any snapshot. Currently a separate `RewriteManifests` action. 4. **Optional affinity between data and delete manifests.** v4 spec allows a delete manifest to be affiliated with a specific data manifest for single-pass planning. Not implemented in this plan. 5. **Removing partition tuples / partition field summaries.** Still an open spec question — this plan materializes partition tuples as in v3. The `TrackedFileAdapters.partition()` TODO ([#16222](https://github.com/apache/iceberg/issues/16222)) tracks the cleanup. 6. **Union-of-all-historical-partition-specs schema for v4 partition types.** Per spec PR text, "the partition tuple schema for data files must be the union schema of all historical partition specs." Initial implementation uses the current spec's partition type only. ## Critical files to modify | Phase | File | |---|---| | 1 | `core/.../TrackedFile.java` — add `WRITER_FORMAT_VERSION` (id 157) | | 1 | `core/.../TrackedFileStruct.java` — add field + builder support | | 2 | `core/.../ManifestWriter.java` — V4Writer / V4DeleteWriter rewrite, REPLACED/MODIFIED counter fixes | | 2 | `core/.../ManifestFiles.java` — read dispatch by Parquet header | | 4 | `api/.../Snapshot.java` — `rootManifestLocation()` default method | | 4 | `core/.../BaseSnapshot.java` — formatVersion field, cacheManifests branch | | 4 | `core/.../SnapshotParser.java` — `root-manifest` JSON key | | 5 | `core/.../SnapshotProducer.java` — apply() v4 branch, rootManifestPath() | | 6 | `core/.../MergingSnapshotProducer.java` — colocated DV emission, REPLACED/MODIFIED pairing | | 6 | `core/.../ManifestFilterManager.java` — recognize rewritten leaves | | 7 | `core/.../DeleteFileIndex.java` — DV grouping from data manifests | | 8 | `core/.../BaseRewriteFiles.java` — admit REPLACED/MODIFIED | | 8 | `core/.../BaseRewriteManifests.java` — emit correct content_entry references | | 9 | `spark/.../SparkWriteConf.java`, `SparkWriteRequirements.java` — v4 defaults | | 9 | `flink/.../sink/RowDataTaskWriterFactory.java` — v4 routing verification | | 10 | `api/src/test/java/.../TestHelpers.java` — `DEFAULT_MAX_TEST_FORMAT_VERSION = 4` | ## Critical files to add | Phase | File | |---|---| | 1 | `core/.../ContentEntryWrapper.java` | | 1 | `core/.../ContentEntryAdapter.java` | | 2 | `core/.../ContentEntryReader.java` | | 3 | `core/.../RootManifests.java` | | 3 | `core/.../RootManifestWriter.java` | | 3 | `core/.../RootManifestReader.java` | | 3 | `core/src/test/java/.../TestRootManifest.java` | | 5 | `core/src/test/java/.../TestV4SnapshotProducer.java` | | 6 | `core/src/test/java/.../TestV4ColocatedDV.java` | ## Reused existing utilities (do not duplicate) - `TrackedFile.schemaWithContentStats(...)` — `core/.../TrackedFile.java:95` - `TrackedFileAdapters.asDataFile / asDVDeleteFile / asEqualityDeleteFile` — PR #16100 (read-direction projection) - `TrackingBuilder` — `core/.../TrackingBuilder.java` (status derivation including REPLACED/MODIFIED from PR #16689) - `TrackedFileStruct`, `TrackingStruct`, `ManifestInfoStruct`, `DeletionVectorStruct` — concrete `StructLike` impls with builders - `StatsUtil.contentStatsFor(schema)` — schema → content_stats type - `MetricsUtil.fromMetrics(schema, metrics)` — `Metrics` → `ContentStats` - `MetricsUtil.valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds(ContentStats)` — PR #16100 (reverse projection helpers) - `InternalData.write(FileFormat.PARQUET, file)` — same path the existing V4Writer uses - `EntryStatus` — including `MODIFIED` from PR #16689 - `GenericManifestFile` — for reconstructing `ManifestFile` objects from root manifest reads ## Top risks 1. **Phase 6 REPLACED/MODIFIED pairing semantics.** Introducing a new commit pattern (one row in a leaf manifest produces two rows in the rewritten leaf — one REPLACED, one MODIFIED) is the largest design risk. Edge cases: concurrent DV updates on the same data file (must conflict-detect), DV update plus file deletion in the same snapshot (which status wins?), DV update on a file that was just added in the same commit (collapse to single ADDED with DV). 2. **Read-path consistency for v3→v4 upgraded tables.** Phase 4 makes `BaseSnapshot.cacheManifests()` format-version-aware, but `SnapshotParser.fromJson` needs the table's `formatVersion` at parse time. Where snapshots are parsed without table context (e.g., in some catalog code paths), the `formatVersion` source must be threaded through. Missing this leads to runtime classloading-style failures on read. 3. **Engine test flakiness on v4 (Phase 9-10).** Spark and Flink integration tests have many DV-related assertions hardcoded to v3 file shapes. Re-enabling v4 in their parameterized suites will surface a long tail of small fixes. Budget time accordingly. 4. **`ManifestFiles.read()` format detection.** Dispatch is layered: snapshot JSON key chooses `manifest-list` vs `root-manifest` paths (Phase 4); the root manifest itself is always content_entry; leaf reads from a root use the entry's `writer_format_version` as a hint passed by the caller. No production caller inspects file metadata. The schema-shape fallback (presence of field 134 or 147 in the Parquet footer schema) is reserved for tests/ad-hoc tooling. Risk reduces to making sure all snapshot-rooted callers (`BaseSnapshot.cacheManifests`, `RootManifestReader`, `BaseRewriteManifests`, `RemoveSnapshots` walks) thread the hint through. 5. **Migration of historical position-delete files.** v3→v4 upgraded tables may have pre-existing `DeleteFile(POSITION_DELETES)` entries in carried-over delete manifests. The v4 read path must handle these correctly through `TrackedFileAdapters` (reading the v3 entries as legacy `DeleteFile` objects, not as colocated DVs). v4 writers must never re-emit them as v4-shape position-delete entries. Test explicitly with a v3→v4 upgrade migration test. ## Verification Build & test (per phase): - `./gradlew :iceberg-core:spotlessApply :iceberg-core:compileJava :iceberg-core:compileTestJava` - Phase 2: `./gradlew :iceberg-core:test --tests TestManifestWriterVersions --tests TestContentEntryReader` - Phase 3: `./gradlew :iceberg-core:test --tests TestRootManifest` - Phase 4: `./gradlew :iceberg-core:test --tests TestSnapshotJson --tests TestBaseSnapshotV4` - Phase 5: `./gradlew :iceberg-core:test --tests TestV4SnapshotProducer` - Phase 6: `./gradlew :iceberg-core:test --tests TestV4ColocatedDV --tests TestRowDelta` - Phase 7: `./gradlew :iceberg-core:test --tests TestDeleteFileIndex --tests TestManifestGroup` - Phase 8: `./gradlew :iceberg-core:test --tests TestRewriteFiles --tests TestRewriteManifests --tests TestRemoveSnapshots --tests TestReplacePartitions` - Phase 9: `./gradlew :iceberg-spark-3.5_2.12:test :iceberg-flink-1.20:test` (representative subset) - Phase 10: full `:iceberg-core:test :iceberg-api:test :iceberg-data:test :iceberg-spark-3.5_2.12:test :iceberg-flink-1.20:test` End-to-end sanity for Phase 5 (write path basics): 1. Create an unpartitioned v4 table. 2. AppendFiles → snapshot has `rootManifestLocation` set, `manifestListLocation` null; root manifest is `.parquet` with `format-version=4` header; one `DATA_MANIFEST` content_entry (with `writer_format_version=1`) references one new leaf data manifest, which itself carries `format-version=4` and content_entry rows with `writer_format_version=1`. 3. AppendFiles again → root has two `DATA_MANIFEST` entries, first marked `EXISTING`, second `ADDED`. 4. DeleteFiles removing one file from the first leaf → root has one rewritten leaf reference (`ADDED`) replacing the prior, plus the second leaf still `EXISTING`; the rewritten leaf has the removed file's entry marked `DELETED`. End-to-end sanity for Phase 6 (DV semantics): 5. Append data file D1 → leaf manifest M1 with one `ADDED` entry, `deletion_vector=null`. 6. RowDelta adding a DV for D1 → new leaf manifest M2 with two entries: D1 `REPLACED` (no DV) and D1 `MODIFIED` (with DV pointing into a Puffin file). Root manifest references M2 as `ADDED`, M1 omitted. 7. Concurrent RowDelta also adding a DV for D1 → second commit fails with `ValidationException`. 8. Append a brand-new data file D2 with a DV born in the same commit → leaf has one `ADDED` entry with `deletion_vector` populated. 9. Re-read all snapshots end-to-end via `Table.newScan().planFiles()`; verify file lists and DV applications match expected. End-to-end sanity for Phase 10 (full parity): 10. v3→v4 upgrade: create v3 table with appends + position-delete files, upgrade to v4, perform RowDelta + RewriteFiles + RemoveSnapshots; verify all operations succeed and produce spec-conformant v4 metadata while still reading historical v3 entries correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
