krisnaru opened a new pull request, #16220:
URL: https://github.com/apache/iceberg/pull/16220

   ## Summary
   
   - Removes broken snapshotIds.contains(entry.snapshotId()) filter from 
RewriteTablePathUtil that silently dropped live, un replicated files when their 
adding snapshot was expired
   - Replaces snapshot-ID-based entry filtering with entry.isLive() only — all 
live content files are included in the copy plan
   - Adds anti-join dedup for incremental copies: filters out 
previously-replicated files by comparing against the start version's content 
files via contentFileDS
   - Fixes both full copy (expired snapshot IDs missing from 
endMetadata.snapshots()) and incremental copy (expired snapshot IDs missing 
from deltaSnapshotIds after RewriteManifests + ExpireSnapshots)
   
   ## Problem
   
   RewriteTablePathUtil.writeDataFileEntry filtered copy plan entries using 
snapshotIds.contains(entry.snapshotId()). This is fundamentally broken because 
entry.snapshotId() reflects the snapshot that originally added the file, which 
can be expired at any time by table maintenance. When RewriteManifests 
reorganizes entries from expired snapshots into new manifests, the manifest is 
correctly selected for processing but the entry-level filter drops the files — 
causing silent data loss at the target.
   
   **Scenario**: 
   Append (S3) → RewriteManifests (S4) → Expire S3 → Incremental replication 
builds deltaSnapshotIds = {S4}, but file C has entry.snapshotId() = S3 → 
{S4}.contains(S3) == false → file C excluded from copy plan.
   
   ## Approach
   1. Entry level: Include all live entries in copy plan (no snapshot ID check)
   2. Incremental dedup: Anti-join against start version's content files using 
contentFileDS from BaseSparkAction
   3. Manifest selection: Unchanged — manifestsToRewrite still uses 
deltaSnapshotIds from version history
   
   ## Test plan
   
   - Verify existing TestRewriteTablePathsAction tests pass (full copy, 
incremental, expire scenarios)
   - Verify existing TestRewriteTablePathUtil tests pass
   - Add integration test: incremental replication after append + 
RewriteManifests + ExpireSnapshots
   - Verify incremental copy produces correct file counts (no duplicate copies 
of already-replicated files)
   - Verify full copy after many snapshots with expired snapshot IDs includes 
all live data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to