Re: [I] v4: add manifest write support [iceberg]

via GitHub Sat, 06 Jun 2026 15:23:54 -0700


stevenzwu commented on issue #16694:
URL: https://github.com/apache/iceberg/issues/16694#issuecomment-4640623879


   # V4 Adaptive Metadata Tree — Write Path (Full v3 Parity)
   
   ## Context
   
   Iceberg v4 [spec PR #16025](https://github.com/apache/iceberg/pull/16025) 
replaces the v3 manifest list with a single **root manifest** per snapshot. 
Root manifests use a unified `content_entry` schema that can reference leaf 
data/delete manifests (as `DATA_MANIFEST` / `DELETE_MANIFEST` entries) and may 
also hold direct data file entries. Leaf data manifests use the same 
`content_entry` schema, with optional co-located DV pointers in the 
`deletion_vector` field. v4 forbids `content_type=POSITION_DELETES` (the field 
id is reserved but never written) — DVs become a colocated property of the data 
file's content_entry, not a separate file class.
   
   Today, v4 tables already route through dedicated writers 
(`ManifestListWriter.V4Writer`, `ManifestWriter.V4Writer`, 
`ManifestWriter.V4DeleteWriter`), but those writers still emit **v3-shape 
output** (manifest list + v3-style `manifest_entry` rows, just tagged 
`format-version=4`). Per-version schema interfaces for `content_entry` already 
exist as standalone, unwired Java types: `TrackedFile`, `Tracking`, 
`ManifestInfo`, `DeletionVector` (each with a `*Struct` impl + builder). [PR 
#16677](https://github.com/apache/iceberg/pull/16677) decouples 
`DEFAULT_MAX_TEST_FORMAT_VERSION = 3` from `MAX_FORMAT_VERSION = 4` so we can 
churn v4 internals without disturbing the broad parameterized test suite. [PR 
#16100](https://github.com/apache/iceberg/pull/16100) adds 
`TrackedFileAdapters` — read-direction adapters that present `TrackedFile` 
entries (including colocated DVs) behind the legacy `DataFile` / `DeleteFile` 
interfaces, keeping `FileScanTask` / `ManifestGroup` / `DeleteFileIndex` larg
 ely unchanged.
   
   This plan implements the **full write path** for v4: root manifests + 
content_entry leaf manifests, colocated DV writes, all `SnapshotProducer` 
subclasses (Append, Overwrite, RowDelta, RewriteFiles, RewriteManifests, 
ReplacePartitions), engine integration in Spark and Flink, and full 
re-enablement of the parameterized test suite at v4. The goal is parity with 
the v3 commit surface, not just a minimal substrate.
   
   ## Approach
   
   Ten phases. Each phase produces an independently reviewable PR (or small PR 
series). Phases 1–5 establish the on-disk format and unwire SnapshotProducer's 
manifest-list dependency. Phase 6 rewrites the row-delta path for colocated 
DVs. Phases 7–9 extend that to read planning, the remaining SnapshotProducer 
subclasses, and engine writers. Phase 10 re-enables v4 in the broad 
parameterized test suite.
   
   **Prereqs**:
   - [PR #16677](https://github.com/apache/iceberg/pull/16677) lands first — 
without it, every phase will have collateral test failures unrelated to the v4 
work.
   - [PR #16100](https://github.com/apache/iceberg/pull/16100) lands by Phase 2 
— `ContentEntryReader` builds on `TrackedFileAdapters` rather than reinventing 
the TrackedFile→DataFile/DeleteFile projection.
   - [PR #16689](https://github.com/apache/iceberg/pull/16689) 
(`EntryStatus.MODIFIED` + `TrackingBuilder` status derivation) lands first — 
Phases 2 and 6 depend on `MODIFIED` and the `REPLACED`/`MODIFIED` pairing 
semantics.
   
   **Critical decisions baked in**:
   - `BaseSnapshot` stores `formatVersion` so `cacheManifests()` can dispatch 
between `ManifestLists.read()` and `RootManifests.read()` deterministically.
   - v4 leaf manifests written by Phase 2 are 100% content_entry — the existing 
`format-version=4` Parquet header becomes a clean read-dispatch signal (no 
extra header).
   - DVs are projected back to legacy `DeleteFile(POSITION_DELETES)` via 
`TrackedFileAdapters` on the read side; downstream scan planning sees the 
v3-shape layout it already understands.
   - All five `SnapshotProducer` subclasses inherit a single v4 commit path 
through `MergingSnapshotProducer`. Subclass-specific work in Phase 8 is small: 
most subclasses just work once the leaf writer, root-manifest path, and DV 
pairing are in place.
   
   ---
   
   ### Phase 1: Add `writer_format_version` field + adapter wrappers (no 
behavior change)
   
   Add the missing required field to `TrackedFile` (per spec PR #16025: field 
id **157**, `writer_format_version`, int, required for both write and read; v4 
writers must produce value `1`; entries that reference legacy pre-v4 leaf 
manifests use value `0`):
   
   - `core/.../TrackedFile.java`: add `WRITER_FORMAT_VERSION = 
Types.NestedField.required(157, "writer_format_version", 
Types.IntegerType.get(), ...)` and include it in `schemaWithContentStats(...)`. 
Add `writerFormatVersion()` accessor.
   - `core/.../TrackedFileStruct.java`: add the field to the positional 
`StructLike` impl + builder.
   - Update `core/src/test/java/org/apache/iceberg/TestTrackedFile.java` and 
`TestTrackedFileStruct.java` to cover the new field.
   
   Then build the write-direction wrappers around the now-complete schema:
   
   - `core/.../ContentEntryWrapper.java` (new) — `StructLike` writer wrapper 
analogous to `V4Metadata.ManifestEntryWrapper` 
(`core/.../V4Metadata.java:316`), but mapping the `content_entry` field layout 
(TRACKING + CONTENT_TYPE + WRITER_FORMAT_VERSION + LOCATION + FILE_FORMAT + ... 
+ DELETION_VECTOR + MANIFEST_INFO + ...) for positional access during Parquet 
write.
   - `core/.../ContentEntryAdapter.java` (new) — converts `ContentFile<?>` + 
`ManifestEntry` status/seq/snapshot fields + optional DV reference into a 
`TrackedFileStruct` with `writer_format_version=1`. Uses existing 
`TrackingBuilder` for the tracking nested struct, and 
`StatsUtil.contentStatsFor(schema)` for content_stats. For root-manifest 
entries that reference legacy v3 leaf manifests (the upgrade case in Phase 5), 
the adapter accepts an explicit override to set `writer_format_version=0`. Also 
provides a method to populate the `deletion_vector` struct when a 
position-delete `DeleteFile` is being collapsed into a colocated DV (used by 
Phase 6).
   
   Reuse `TrackedFile.schemaWithContentStats(partitionType, contentStatsType)` 
as the canonical content_entry schema — no new schema helper class. 
Reverse-direction conversion is provided by `TrackedFileAdapters` (PR #16100); 
do not duplicate.
   
   Reused utilities (do not duplicate):
   - `TrackedFile.schemaWithContentStats(...)` — canonical schema builder.
   - `TrackingBuilder.added(snapshotId)` / `from(source, snapshotId)` / 
`replaced(...)` / `modified(...)` — status derivation already accounts for 
REPLACED/MODIFIED.
   - `TrackedFileStruct`, `TrackingStruct`, `ManifestInfoStruct`, 
`DeletionVectorStruct` — concrete impls already exist with builders.
   - `StatsUtil.contentStatsFor(schema)` — schema → content_stats nested type.
   - `MetricsUtil.fromMetrics(schema, metrics)` — `Metrics` → `ContentStats`.
   - `PartitionSpec.rawPartitionType()` — partition struct type.
   
   **Files**: 2 modified (`TrackedFile`, `TrackedFileStruct`) + 2 test updates 
+ 2 new (`ContentEntryWrapper`, `ContentEntryAdapter`). **No public API 
changes; no callers wired yet** — these are validated by Phase 2's round-trip 
tests.
   
   ---
   
   ### Phase 2: Replace v4 leaf manifest writers + ship matching reader
   
   Both write and read for content_entry leaf manifests must land together — 
`TestManifestWriterVersions.testV4Write*` is round-trip and breaks the moment 
the writer changes shape without a matching reader.
   
   - Rewrite `ManifestWriter.V4Writer` (`core/.../ManifestWriter.java:271-310`) 
and `ManifestWriter.V4DeleteWriter` (`core/.../ManifestWriter.java:312-355`) to 
emit `content_entry` rows via `ContentEntryWrapper`, using 
`TrackedFile.schemaWithContentStats(spec.rawPartitionType(), 
StatsUtil.contentStatsFor(schema).type().asStructType())`. All emitted rows 
carry `writer_format_version=1`. Output stays Parquet (already the v4 default 
in `SnapshotProducer.newManifestOutputFile()` at 
`core/.../SnapshotProducer.java:148-151`).
   - Both writers continue to emit the existing Parquet metadata key 
`format-version=4` for compatibility/auditing, but **read-dispatch does not 
depend on it**. See dispatch design below.
   - Fix `ManifestWriter.addEntry()`'s status counters to handle `REPLACED` and 
`MODIFIED` correctly — both are needed by Phase 6's DV updates. `REPLACED` 
increments `replacedFiles`/`replacedRows`; `MODIFIED` increments live counters. 
`toManifestFile()` populates the corresponding `ManifestInfo` slots.
   - New `core/.../ContentEntryReader.java` — reads `content_entry` Parquet 
rows into `TrackedFileStruct`, then uses `TrackedFileAdapters` (PR #16100) to 
project each entry to the appropriate legacy view: `asDataFile(...)` for 
`content_type=DATA` without a DV, `asDataFile(...)` + `asDVDeleteFile(...)` for 
`content_type=DATA` with a colocated DV, `asEqualityDeleteFile(...)` for 
`content_type=EQUALITY_DELETES`. Yields `ManifestEntry<DataFile>` / 
`ManifestEntry<DeleteFile>` so all downstream consumers (`ManifestGroup`, 
`DeleteFileIndex`, `MergingSnapshotProducer`'s manifest-rewrite paths) work 
unchanged.
   
   **Read-dispatch design (layered, no file-level guessing in production 
paths)**
   
   Dispatch is fully determined by where in the manifest tree you are. No 
production caller inspects file metadata to decide which reader to use.
   
   1. **Snapshot → manifest tree.** Determined by which JSON key the snapshot 
has: `manifest-list` (v1–v3) or `root-manifest` (v4). Set in Phase 4. 
`BaseSnapshot.cacheManifests()` branches on `formatVersion`.
   2. **Root manifest read.** A snapshot with `root-manifest` always points to 
a content_entry-shape root. `RootManifestReader` is used unconditionally — no 
per-file dispatch.
   3. **Root → leaf manifest read.** Each manifest-reference content_entry in 
the root carries `writer_format_version`: `1` (v4 leaf, content_entry schema) 
or `0` (legacy v3 leaf carried over during a v3→v4 upgrade). 
`RootManifestReader` calls `ManifestFiles.read(manifest, io, specsById, 
writerFormatVersion)` which routes to `ContentEntryReader` or legacy 
`ManifestReader` accordingly.
   4. **v3 manifest-list paths.** Snapshot has `manifest-list` → entries point 
to legacy v3 manifests → legacy `ManifestReader` unconditionally. No dispatch 
needed because v3 manifest lists never reference v4-shape leaves.
   
   This means `ManifestFiles.read(...)` (`core/.../ManifestFiles.java:152`) and 
`readDeleteManifest(...)` (line 320) gain an optional `writerFormatVersion` 
parameter. All snapshot-rooted callers supply it: 
`BaseSnapshot.cacheManifests`, `RootManifestReader`, `BaseRewriteManifests`, 
`RemoveSnapshots` / `IncrementalFileCleanup` / `ReachableFileUtil` walks. None 
of these paths inspect file metadata.
   
   **Fallback (non-production paths only).** When a caller opens a manifest 
file outside the snapshot tree (tests writing-then-reading without committing, 
ad-hoc tooling), it can omit the hint. The reader then peeks at the Parquet 
footer schema and dispatches on the presence of field id 134 (`content_type`) 
or 147 (`tracking`) — the schema is more self-describing than a metadata key. 
Single footer read, cached on the reader.
   
   **Why not dispatch on `writer_format_version` (field 157) at the file 
level?** It's a per-row field. To inspect it you'd have to commit to a schema 
first, which is circular. The hint-from-caller path uses the same value but 
supplies it externally; the fallback uses schema shape (where the field lives) 
without parsing rows.
   
   - Update existing v4 tests in 
`core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java` to 
assert `content_entry` shape on disk (including `writer_format_version=1`) and 
that the round-trip preserves all data file fields. Add a test that explicitly 
exercises the schema-shape fallback path (read without hint).
   
   **Files**: ~5 modified, 1 new. **Tests**: existing 
`TestManifestWriterVersions` v4 cases expanded; new `TestContentEntryReader` 
round-trip.
   
   ---
   
   ### Phase 3: RootManifestWriter and RootManifestReader
   
   - `core/.../RootManifests.java` (new) — factory mirroring 
`ManifestLists.java`, with `write(formatVersion, outputFile, ...)` and 
`read(inputFile)` entry points.
   - `core/.../RootManifestWriter.java` (new) — accepts `ManifestFile` objects 
(representing manifest references), emits `content_entry` rows with 
`content_type=DATA_MANIFEST` (3) or `DELETE_MANIFEST` (4) and `manifest_info` 
populated from each `ManifestFile`'s counts and min sequence number. Uses 
`ManifestInfoStruct.Builder`. Sets `writer_format_version=1` for entries 
referencing v4 leaf manifests; sets `writer_format_version=0` for entries 
referencing legacy v3 leaf manifests carried over during a v3→v4 upgrade. Also 
accepts direct data-file entries (`content_type=DATA` with optional DV) for 
callers that want to skip a leaf manifest. Emits `format-version=4` Parquet 
header. Output extension `.parquet`.
   - `core/.../RootManifestReader.java` (new) — reads `content_entry` rows: for 
`content_type ∈ {3, 4}` reconstructs `GenericManifestFile` objects (so 
downstream `ManifestGroup` / `DeleteFileIndex` work unchanged); for 
`content_type ∈ {0, 2}` projects via `TrackedFileAdapters` and synthesizes a 
single-entry `GenericManifestFile` so the data file flows through the same 
downstream pipeline. The synthesis keeps `ManifestGroup`'s API contract intact 
while supporting the small-write optimization on read.
   - New `core/src/test/java/org/apache/iceberg/TestRootManifest.java` — 
round-trip tests: write a small set of manifest references and direct data-file 
entries, read back, verify counts and identity.
   
   **Files**: 3 new + 1 test file. **No callers yet** — wired in Phase 5.
   
   ---
   
   ### Phase 4: Snapshot plumbing for `root-manifest`
   
   - `api/src/main/java/org/apache/iceberg/Snapshot.java`: add `default String 
rootManifestLocation() { return null; }`. Default-null (not a new 
sub-interface) matches the precedent of `schemaId()` / `firstRowId()` / 
`addedRows()` / `keyId()`.
   - `core/src/main/java/org/apache/iceberg/BaseSnapshot.java`: add 
`rootManifestLocation` field + new constructor overload. **Also add a 
`formatVersion` field** so `cacheManifests()` (`core/.../BaseSnapshot.java:186` 
region — the unconditional `ManifestLists.read(...)` call) can dispatch 
deterministically. v1–v3 snapshots store `manifestListLocation`; v4 snapshots 
store `rootManifestLocation`. Constructor enforces exactly one is non-null.
   - `core/src/main/java/org/apache/iceberg/SnapshotParser.java`: add 
`ROOT_MANIFEST = "root-manifest"` constant. In `toJson`, write `root-manifest` 
when v4+ and `manifest-list` otherwise. In `fromJson`, accept either key; 
`formatVersion` is passed in via the table metadata, which is already available 
at parse time.
   - `cacheManifests()` in `BaseSnapshot`: branch on `formatVersion >= 4` → 
`RootManifests.read(io.newInputFile(rootManifestLocation))`; else existing 
`ManifestLists.read(...)`. Both produce `List<ManifestFile>` so `ManifestGroup` 
is untouched.
   
   **Files**: 3 modified (1 API, 2 core). **Tests**: `TestSnapshotJson` adds v4 
root-manifest round-trip; new `TestBaseSnapshotV4` covers the cacheManifests 
branch.
   
   ---
   
   ### Phase 5: SnapshotProducer integration
   
   - `core/.../SnapshotProducer.java`:
     - `manifestListPath()` (line 615) stays for v1–v3.
     - Add `rootManifestPath()` that produces 
`<metadata>/snap-<id>-<attempt>-<UUID>.parquet`.
     - In `apply()` (line 297), branch on `base.formatVersion() >= 4`:
       - v4 path: open `RootManifests.write(4, rootManifestPath(), ...)`, feed 
it the `manifests` list from `apply(base, parentSnapshot)`. Each `ManifestFile` 
becomes a `DATA_MANIFEST` / `DELETE_MANIFEST` content_entry. Build 
`BaseSnapshot` with `rootManifestLocation = rootManifestPath().location()` and 
`manifestListLocation = null`.
       - v3 path: existing behavior unchanged.
   - v3→v4 upgrade migration (first commit on a v3 table that's been upgraded 
to v4): the previous snapshot's manifests are read via 
`ManifestLists.read(parent.manifestListLocation())` and propagated through 
`apply(base, parentSnapshot)` as `EXISTING` manifest references with 
`writer_format_version=0`. No special-case branch needed.
   - For appends, file removals (copy-on-write leaf rewrite), and overwrites 
the existing `MergingSnapshotProducer` flow works untouched: append paths add 
new leaf manifests via `newManifestWriter()` (already polymorphic on format 
version, fixed in Phase 2); file removal rewrites the affected leaf manifest in 
copy-on-write style; new leaf becomes `ADDED`, prior leaf reference is omitted.
   
   **Files**: 1 modified. **Tests**: new 
`core/src/test/java/org/apache/iceberg/TestV4SnapshotProducer.java` covers 
append, file-removal delete, and overwrite end-to-end.
   
   ---
   
   ### Phase 6: Colocated DV writes in `MergingSnapshotProducer`
   
   Largest behavioral change. v4 spec forbids `content_type=POSITION_DELETES`; 
DVs live in the data file's `content_entry.deletion_vector` struct. When a 
snapshot adds or changes a DV on a data file already in a leaf manifest, that 
leaf is rewritten in copy-on-write style: the prior entry becomes `REPLACED`, 
the new entry (same data file location, populated `deletion_vector` struct) 
becomes `MODIFIED`. Both share the same row identity but different status/dv 
state.
   
   - `core/.../MergingSnapshotProducer.java`:
     - On v4, position-delete `DeleteFile` inputs (collected today in 
`dvsByReferencedFile`, lines 69-100 region) are not emitted as separate 
delete-manifest entries. Instead, for each affected data file: locate the data 
file's current leaf manifest, mark the existing data-file content_entry as 
`REPLACED` and add a new content_entry (same location, same partition, same 
stats) with `MODIFIED` status and `deletion_vector` populated from the input 
DV's Puffin reference (`ContentEntryAdapter.withDeletionVector(...)` from Phase 
1).
     - When the input is a fresh data file with a DV from the same snapshot 
(i.e., the new file is born with a DV), a single `ADDED` content_entry with 
populated `deletion_vector` is emitted — no REPLACED/MODIFIED pair.
     - `validateDeleteFileForVersion()` (already rejects `POSITION_DELETES` 
files at v4) becomes a routing decision rather than an error: callers may still 
hand position-delete `DeleteFile` objects to v4 RowDelta; the producer 
collapses them into colocated DVs instead of failing.
     - `RowDelta.validateAddedDataFiles` / `validateAddedDeleteFiles`: extend 
the v4 branch to detect write-write conflicts on DV state — if a concurrent 
snapshot also produced REPLACED/MODIFIED for the same data-file row, fail with 
`ValidationException`.
   - `core/.../ManifestFilterManager.java`: when computing the carried-over 
manifests, recognize that a leaf manifest with REPLACED/MODIFIED rows produced 
by this commit has been rewritten — emit only the new manifest reference at 
root level.
   - Update the SnapshotSummary counters to track `REPLACED` rows and DV 
cardinality changes.
   
   **Files**: ~4 modified, 0 new. **Tests**: new `TestV4ColocatedDV` covers (a) 
add DV to existing data file, (b) replace existing DV, (c) data file born with 
DV, (d) concurrent DV updates conflict. Existing `TestRowDelta` cases 
parameterized for v4 in Phase 10.
   
   **Risk**: high. This rewrites the heart of the row-delta path, and the 
REPLACED/MODIFIED pairing introduces new commit-conflict semantics. The single 
biggest reviewer attention area.
   
   ---
   
   ### Phase 7: Read-path DV resolution wiring
   
   `TrackedFileAdapters` (PR #16100) already projects colocated DVs back to 
legacy `DeleteFile(POSITION_DELETES)` shape on the read side, so most of the 
scan planning chain works without changes. This phase is the small set of glue 
updates needed to route those synthesized delete files correctly.
   
   - `core/.../DeleteFileIndex.java`: when `ContentEntryReader` yields 
synthesized DV `DeleteFile` objects from a data manifest (rather than a 
separate delete manifest), `DeleteFileIndex.builderFor(...)` must group them by 
`referencedDataFile()` exactly as it does for v3 DVs. Verify that the existing 
grouping logic handles DVs that arrive interleaved with their target data files 
in the same manifest read.
   - `core/.../ManifestGroup.java`: confirm that filtering on a v4 data 
manifest's content_entry rows correctly applies to the projected DataFile (not 
the synthesized DV DeleteFile). The simplest invariant: the partition/stats 
filter is evaluated on the DataFile projection only; the DV is attached after 
the data file passes filtering.
   - `core/.../ScanPlanningMode.java` (if applicable): no changes — single-pass 
planning is unchanged because data and delete content arrive together in the 
data manifest.
   - `core/.../BaseTableScan.java`: smoke check that `planFiles()` produces the 
same `FileScanTask` shape for v3 and v4 tables under the same workload.
   
   **Files**: ~2 modified (mostly verification, small adjustments). **Tests**: 
parameterize representative scan-planning tests at v4 in this phase (selected 
subset, full re-enablement in Phase 10).
   
   ---
   
   ### Phase 8: All `SnapshotProducer` subclasses through v4
   
   Most subclasses inherit `MergingSnapshotProducer` and come along for free 
once Phases 5–6 land. This phase is the targeted work to make the rest pass v4 
parameterization.
   
   - `MergeAppend`, `FastAppend`, `BaseOverwriteFiles`, `BaseRowDelta`, 
`BaseReplacePartitions`: work transparently because they call 
`newManifestWriter()` / `newDeleteManifestWriter()` (polymorphic on format 
version, fixed in Phase 2) and feed `apply(base, parentSnapshot)`'s 
`List<ManifestFile>` to `SnapshotProducer.apply()` (branches on v4 in Phase 5).
   - `core/.../BaseRewriteFiles.java`: rewrite-assertion logic compares 
before/after manifests against `EntryStatus.ADDED`/`EXISTING`. Extend to admit 
`REPLACED`/`MODIFIED` for v4 (where rewriting a data file with a DV update is a 
valid intermediate state).
   - `core/.../BaseRewriteManifests.java`: operates at the manifest level and 
produces new manifest references. Ensure the rewritten manifests carry correct 
`writer_format_version` and `manifest_info` when re-emitted into a root 
manifest. Likely a small targeted change in the result-construction code path.
   - `core/.../RemoveSnapshots.java`, `core/.../IncrementalFileCleanup.java`, 
`core/.../ReachableFileUtil.java`: read-only walks over the manifest tree must 
follow root manifests for v4 snapshots and manifest lists for v1–v3. 
`BaseSnapshot.cacheManifests()` (Phase 4) handles this transparently for most 
callers; check direct readers of `manifestListLocation()` and update them to 
use the v4-aware `dataManifests(io)` / `deleteManifests(io)` accessors instead.
   
   **Files**: ~5 modified. **Tests**: parameterize `TestRewriteFiles`, 
`TestRewriteManifests`, `TestRemoveSnapshots`, `TestReplacePartitions` at v4 
(selected representative cases in this phase, full coverage in Phase 10).
   
   ---
   
   ### Phase 9: Engine integration (Spark, Flink)
   
   Engine row-delta paths today produce `DeleteFile` objects with 
`FileContent.POSITION_DELETES` — Phase 6 now collapses those into colocated DVs 
at the catalog layer. So in principle no engine change is required for 
correctness. But two things still need attention:
   
   - **Spark**:
     - `spark/.../SparkPositionDeltaWriter.java`, 
`spark/.../SparkFileWriterFactory.java`: confirm the position-delete writer 
still emits a Puffin DV blob (existing v3 behavior) and that the resulting 
`DeleteFile` flows through `RowDelta.addRows()/addDeletes()` cleanly. The 
catalog layer collapses it; the engine doesn't need to know.
     - `spark/.../SparkWriteConf.java`, 
`spark/.../SparkWriteRequirements.java`: add a v4-aware branch that surfaces 
"use DVs" as the default for v4 (matches current v3 default when 
`write.delete.mode=merge-on-read` with DVs).
     - `spark/.../actions/RewriteDataFilesSparkAction.java`, 
`RewritePositionDeleteFilesSparkAction.java`: verify v4 semantics — 
rewrite-position-delete files becomes "compact DVs" (since v4 only has DVs, not 
position-delete files). The action may simplify or no-op on v4.
   - **Flink**:
     - `flink/.../sink/IcebergFilesCommitter.java`, 
`flink/.../sink/RowDataTaskWriterFactory.java`: same shape as Spark — ensure 
existing position-delete output flows through to the v4 collapse path in 
`MergingSnapshotProducer`.
     - `flink/.../source/IcebergSource.java`: read-side gets DVs via 
`TrackedFileAdapters` projection — no change needed.
   
   **Files**: ~3-5 modified Spark, ~2-3 modified Flink. **Risk**: medium. The 
engine plumbing looks correct on inspection, but the position-delete writer 
paths have many integration tests that will exercise the v4 collapse end-to-end 
in Phase 10.
   
   ---
   
   ### Phase 10: Re-enable v4 in broad parameterized tests
   
   Now this is a deliverable, not a triage exercise.
   
   - Bump `DEFAULT_MAX_TEST_FORMAT_VERSION` to 4 in 
`api/src/test/java/org/apache/iceberg/TestHelpers.java` (constant from #16677). 
This re-enables `TestBase`-derived suites at v4: `TestMergeAppend`, 
`TestOverwrite`, `TestRewriteFiles`, `TestReplacePartitions`, `TestRowDelta`, 
`TestRemoveSnapshots`, plus engine equivalents in `data/`, `spark/`, and 
`flink/`.
   - Triage and fix:
     - Tests that assert v4 manifests have a manifest-list path → update to 
assert `rootManifestLocation`.
     - Tests that read manifest list bytes directly → switch to 
`RootManifestReader`.
     - Tests asserting `DeleteFile(POSITION_DELETES)` shape on disk → update to 
assert colocated DV in the data manifest.
     - Spec-conformance tests for the on-disk format: roaring bitmap encoding 
for manifest DVs, exact field IDs, optional/required field handling, 
REPLACED+MODIFIED pair invariants.
     - Migration tests: v3→v4 upgrade with snapshots that have position-delete 
files. These must read correctly under the v4 read path (handled by the 
v4-aware `cacheManifests()` from Phase 4 plus the `writer_format_version=0` 
content references from Phase 5) even though they can never be re-emitted as v4 
writes.
   - Each batch of test fixes is a small follow-up PR within Phase 10.
   
   **Files**: 1 modified + dozens of test-only adjustments. **Risk**: medium — 
test churn dominates, but no new architectural risk.
   
   ---
   
   ## Out of scope (explicit follow-ups)
   
   1. **Direct data-file entries in the root manifest as a default behavior** 
(the small-write optimization). Phase 3's writer accepts them; no commit path 
emits them by default. A follow-up adds a heuristic in 
`SnapshotProducer.apply()` to skip leaf-manifest creation for sufficiently 
small writes.
   2. **Manifest delete vectors at the root level.** v4 spec allows marking a 
leaf manifest's row as deleted via a manifest DV stored in the root, avoiding 
leaf-manifest rewrite. All deletes in this plan still go through copy-on-write 
leaf rewrites. Adding root-level manifest DVs is a write-amplification 
optimization for high-churn tables.
   3. **Manifest compaction during commit.** v4 spec allows compacting existing 
manifests as part of any snapshot. Currently a separate `RewriteManifests` 
action.
   4. **Optional affinity between data and delete manifests.** v4 spec allows a 
delete manifest to be affiliated with a specific data manifest for single-pass 
planning. Not implemented in this plan.
   5. **Removing partition tuples / partition field summaries.** Still an open 
spec question — this plan materializes partition tuples as in v3. The 
`TrackedFileAdapters.partition()` TODO 
([#16222](https://github.com/apache/iceberg/issues/16222)) tracks the cleanup.
   6. **Union-of-all-historical-partition-specs schema for v4 partition 
types.** Per spec PR text, "the partition tuple schema for data files must be 
the union schema of all historical partition specs." Initial implementation 
uses the current spec's partition type only.
   
   ## Critical files to modify
   
   | Phase | File |
   |---|---|
   | 1 | `core/.../TrackedFile.java` — add `WRITER_FORMAT_VERSION` (id 157) |
   | 1 | `core/.../TrackedFileStruct.java` — add field + builder support |
   | 2 | `core/.../ManifestWriter.java` — V4Writer / V4DeleteWriter rewrite, 
REPLACED/MODIFIED counter fixes |
   | 2 | `core/.../ManifestFiles.java` — read dispatch by Parquet header |
   | 4 | `api/.../Snapshot.java` — `rootManifestLocation()` default method |
   | 4 | `core/.../BaseSnapshot.java` — formatVersion field, cacheManifests 
branch |
   | 4 | `core/.../SnapshotParser.java` — `root-manifest` JSON key |
   | 5 | `core/.../SnapshotProducer.java` — apply() v4 branch, 
rootManifestPath() |
   | 6 | `core/.../MergingSnapshotProducer.java` — colocated DV emission, 
REPLACED/MODIFIED pairing |
   | 6 | `core/.../ManifestFilterManager.java` — recognize rewritten leaves |
   | 7 | `core/.../DeleteFileIndex.java` — DV grouping from data manifests |
   | 8 | `core/.../BaseRewriteFiles.java` — admit REPLACED/MODIFIED |
   | 8 | `core/.../BaseRewriteManifests.java` — emit correct content_entry 
references |
   | 9 | `spark/.../SparkWriteConf.java`, `SparkWriteRequirements.java` — v4 
defaults |
   | 9 | `flink/.../sink/RowDataTaskWriterFactory.java` — v4 routing 
verification |
   | 10 | `api/src/test/java/.../TestHelpers.java` — 
`DEFAULT_MAX_TEST_FORMAT_VERSION = 4` |
   
   ## Critical files to add
   
   | Phase | File |
   |---|---|
   | 1 | `core/.../ContentEntryWrapper.java` |
   | 1 | `core/.../ContentEntryAdapter.java` |
   | 2 | `core/.../ContentEntryReader.java` |
   | 3 | `core/.../RootManifests.java` |
   | 3 | `core/.../RootManifestWriter.java` |
   | 3 | `core/.../RootManifestReader.java` |
   | 3 | `core/src/test/java/.../TestRootManifest.java` |
   | 5 | `core/src/test/java/.../TestV4SnapshotProducer.java` |
   | 6 | `core/src/test/java/.../TestV4ColocatedDV.java` |
   
   ## Reused existing utilities (do not duplicate)
   
   - `TrackedFile.schemaWithContentStats(...)` — `core/.../TrackedFile.java:95`
   - `TrackedFileAdapters.asDataFile / asDVDeleteFile / asEqualityDeleteFile` — 
PR #16100 (read-direction projection)
   - `TrackingBuilder` — `core/.../TrackingBuilder.java` (status derivation 
including REPLACED/MODIFIED from PR #16689)
   - `TrackedFileStruct`, `TrackingStruct`, `ManifestInfoStruct`, 
`DeletionVectorStruct` — concrete `StructLike` impls with builders
   - `StatsUtil.contentStatsFor(schema)` — schema → content_stats type
   - `MetricsUtil.fromMetrics(schema, metrics)` — `Metrics` → `ContentStats`
   - 
`MetricsUtil.valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds(ContentStats)`
 — PR #16100 (reverse projection helpers)
   - `InternalData.write(FileFormat.PARQUET, file)` — same path the existing 
V4Writer uses
   - `EntryStatus` — including `MODIFIED` from PR #16689
   - `GenericManifestFile` — for reconstructing `ManifestFile` objects from 
root manifest reads
   
   ## Top risks
   
   1. **Phase 6 REPLACED/MODIFIED pairing semantics.** Introducing a new commit 
pattern (one row in a leaf manifest produces two rows in the rewritten leaf — 
one REPLACED, one MODIFIED) is the largest design risk. Edge cases: concurrent 
DV updates on the same data file (must conflict-detect), DV update plus file 
deletion in the same snapshot (which status wins?), DV update on a file that 
was just added in the same commit (collapse to single ADDED with DV).
   2. **Read-path consistency for v3→v4 upgraded tables.** Phase 4 makes 
`BaseSnapshot.cacheManifests()` format-version-aware, but 
`SnapshotParser.fromJson` needs the table's `formatVersion` at parse time. 
Where snapshots are parsed without table context (e.g., in some catalog code 
paths), the `formatVersion` source must be threaded through. Missing this leads 
to runtime classloading-style failures on read.
   3. **Engine test flakiness on v4 (Phase 9-10).** Spark and Flink integration 
tests have many DV-related assertions hardcoded to v3 file shapes. Re-enabling 
v4 in their parameterized suites will surface a long tail of small fixes. 
Budget time accordingly.
   4. **`ManifestFiles.read()` format detection.** Dispatch is layered: 
snapshot JSON key chooses `manifest-list` vs `root-manifest` paths (Phase 4); 
the root manifest itself is always content_entry; leaf reads from a root use 
the entry's `writer_format_version` as a hint passed by the caller. No 
production caller inspects file metadata. The schema-shape fallback (presence 
of field 134 or 147 in the Parquet footer schema) is reserved for tests/ad-hoc 
tooling. Risk reduces to making sure all snapshot-rooted callers 
(`BaseSnapshot.cacheManifests`, `RootManifestReader`, `BaseRewriteManifests`, 
`RemoveSnapshots` walks) thread the hint through.
   5. **Migration of historical position-delete files.** v3→v4 upgraded tables 
may have pre-existing `DeleteFile(POSITION_DELETES)` entries in carried-over 
delete manifests. The v4 read path must handle these correctly through 
`TrackedFileAdapters` (reading the v3 entries as legacy `DeleteFile` objects, 
not as colocated DVs). v4 writers must never re-emit them as v4-shape 
position-delete entries. Test explicitly with a v3→v4 upgrade migration test.
   
   ## Verification
   
   Build & test (per phase):
   - `./gradlew :iceberg-core:spotlessApply :iceberg-core:compileJava 
:iceberg-core:compileTestJava`
   - Phase 2: `./gradlew :iceberg-core:test --tests TestManifestWriterVersions 
--tests TestContentEntryReader`
   - Phase 3: `./gradlew :iceberg-core:test --tests TestRootManifest`
   - Phase 4: `./gradlew :iceberg-core:test --tests TestSnapshotJson --tests 
TestBaseSnapshotV4`
   - Phase 5: `./gradlew :iceberg-core:test --tests TestV4SnapshotProducer`
   - Phase 6: `./gradlew :iceberg-core:test --tests TestV4ColocatedDV --tests 
TestRowDelta`
   - Phase 7: `./gradlew :iceberg-core:test --tests TestDeleteFileIndex --tests 
TestManifestGroup`
   - Phase 8: `./gradlew :iceberg-core:test --tests TestRewriteFiles --tests 
TestRewriteManifests --tests TestRemoveSnapshots --tests TestReplacePartitions`
   - Phase 9: `./gradlew :iceberg-spark-3.5_2.12:test :iceberg-flink-1.20:test` 
(representative subset)
   - Phase 10: full `:iceberg-core:test :iceberg-api:test :iceberg-data:test 
:iceberg-spark-3.5_2.12:test :iceberg-flink-1.20:test`
   
   End-to-end sanity for Phase 5 (write path basics):
   1. Create an unpartitioned v4 table.
   2. AppendFiles → snapshot has `rootManifestLocation` set, 
`manifestListLocation` null; root manifest is `.parquet` with 
`format-version=4` header; one `DATA_MANIFEST` content_entry (with 
`writer_format_version=1`) references one new leaf data manifest, which itself 
carries `format-version=4` and content_entry rows with 
`writer_format_version=1`.
   3. AppendFiles again → root has two `DATA_MANIFEST` entries, first marked 
`EXISTING`, second `ADDED`.
   4. DeleteFiles removing one file from the first leaf → root has one 
rewritten leaf reference (`ADDED`) replacing the prior, plus the second leaf 
still `EXISTING`; the rewritten leaf has the removed file's entry marked 
`DELETED`.
   
   End-to-end sanity for Phase 6 (DV semantics):
   5. Append data file D1 → leaf manifest M1 with one `ADDED` entry, 
`deletion_vector=null`.
   6. RowDelta adding a DV for D1 → new leaf manifest M2 with two entries: D1 
`REPLACED` (no DV) and D1 `MODIFIED` (with DV pointing into a Puffin file). 
Root manifest references M2 as `ADDED`, M1 omitted.
   7. Concurrent RowDelta also adding a DV for D1 → second commit fails with 
`ValidationException`.
   8. Append a brand-new data file D2 with a DV born in the same commit → leaf 
has one `ADDED` entry with `deletion_vector` populated.
   9. Re-read all snapshots end-to-end via `Table.newScan().planFiles()`; 
verify file lists and DV applications match expected.
   
   End-to-end sanity for Phase 10 (full parity):
   10. v3→v4 upgrade: create v3 table with appends + position-delete files, 
upgrade to v4, perform RowDelta + RewriteFiles + RemoveSnapshots; verify all 
operations succeed and produce spec-conformant v4 metadata while still reading 
historical v3 entries correctly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] v4: add manifest write support [iceberg]

Reply via email to