krisnaru opened a new pull request, #16220:
URL: https://github.com/apache/iceberg/pull/16220
## Summary
- Removes broken snapshotIds.contains(entry.snapshotId()) filter from
RewriteTablePathUtil that silently dropped live, un replicated files when their
adding snapshot was expired
- Replaces snapshot-ID-based entry filtering with entry.isLive() only — all
live content files are included in the copy plan
- Adds anti-join dedup for incremental copies: filters out
previously-replicated files by comparing against the start version's content
files via contentFileDS
- Fixes both full copy (expired snapshot IDs missing from
endMetadata.snapshots()) and incremental copy (expired snapshot IDs missing
from deltaSnapshotIds after RewriteManifests + ExpireSnapshots)
## Problem
RewriteTablePathUtil.writeDataFileEntry filtered copy plan entries using
snapshotIds.contains(entry.snapshotId()). This is fundamentally broken because
entry.snapshotId() reflects the snapshot that originally added the file, which
can be expired at any time by table maintenance. When RewriteManifests
reorganizes entries from expired snapshots into new manifests, the manifest is
correctly selected for processing but the entry-level filter drops the files —
causing silent data loss at the target.
**Scenario**:
Append (S3) → RewriteManifests (S4) → Expire S3 → Incremental replication
builds deltaSnapshotIds = {S4}, but file C has entry.snapshotId() = S3 →
{S4}.contains(S3) == false → file C excluded from copy plan.
## Approach
1. Entry level: Include all live entries in copy plan (no snapshot ID check)
2. Incremental dedup: Anti-join against start version's content files using
contentFileDS from BaseSparkAction
3. Manifest selection: Unchanged — manifestsToRewrite still uses
deltaSnapshotIds from version history
## Test plan
- Verify existing TestRewriteTablePathsAction tests pass (full copy,
incremental, expire scenarios)
- Verify existing TestRewriteTablePathUtil tests pass
- Add integration test: incremental replication after append +
RewriteManifests + ExpireSnapshots
- Verify incremental copy produces correct file counts (no duplicate copies
of already-replicated files)
- Verify full copy after many snapshots with expired snapshot IDs includes
all live data
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]