j1wonpark opened a new pull request, #4231:
URL: https://github.com/apache/amoro/pull/4231

   ## Why are the changes needed?
   
   When AMS expires snapshots on a single-ref Iceberg table, Iceberg 
auto-selects
   `IncrementalFileCleanup`. That strategy walks the current snapshot's ancestor
   chain via `SnapshotUtil.ancestorIds`, and **if a parent snapshot is missing
   from table metadata the walk terminates silently**. Snapshots below the break
   are then treated as non-ancestors, so the `ADDED` entries of the manifests 
they
   wrote get reverted during cleanup. When such a manifest was superseded by a
   merge but the data file it added is still carried over (as `EXISTING`) in a
   manifest the current snapshot references, the cleanup deletes a data file 
that
   the live table still points at. Reads of the affected partitions then fail 
with
   `NotFoundException`.
   
   We hit this in production: a tag pinned a snapshot off the main line, normal
   expiration removed the snapshots that linked it back to the head (safe at 
that
   point because `refs >= 2` forced `ReachableFileCleanup`), and once the tag 
was
   dropped the table fell back to a single ref. The next expiration ran
   `IncrementalFileCleanup`, its ancestor walk truncated at the gap, and data 
files
   still referenced by the current snapshot were physically deleted.
   
   ### Mechanism
   
   ```mermaid
   flowchart TB
       HEAD["current snapshot (head)"]
       S_b["S_b"]
       S_a["S_a (off main line)"]
       M_merged["M_merged : F = EXISTING"]
       M_a["M_a : F = ADDED"]
       F["data file F"]
       CUT["walk truncates - S_a unreachable"]
       DEL["F deleted"]
       LOSS["NotFoundException"]
   
       HEAD --> S_b
       S_b -.->|"parent missing, walk stops"| CUT
       HEAD -->|"manifest list"| M_merged
       S_a -->|"manifest list"| M_a
       M_merged --> F
       M_a --> F
       S_a ==>|"treated as expirable, M_a reverted"| DEL
       F -.-> DEL
       DEL -.->|"M_merged still references F"| LOSS
   ```
   
   The same data file `F` is referenced through two manifests — `M_a` (`ADDED`,
   written by the off-line snapshot) and `M_merged` (`EXISTING`, carried over 
and
   still referenced by the head). When the walk drops `S_a`, `M_a` becomes a 
revert
   target, and cleanup deletes `F` without checking that `M_merged` keeps it 
alive.
   
   ### Upstream status
   
   This is the same root cause as Iceberg issue
   [apache/iceberg#13568](https://github.com/apache/iceberg/issues/13568),
   fixed by [apache/iceberg#13614](https://github.com/apache/iceberg/pull/13614)
   (milestone **1.10.0**). That fix hardens the strategy selector in
   `RemoveSnapshots` — `validateCleanupCanBeIncremental()` only allows 
incremental
   cleanup when no snapshots are explicitly removed, all removed snapshots are
   ancestors of main (`hasRemovedNonMainAncestors`), and no non-main snapshots
   exist before or after expiration (`hasNonMainSnapshots`) — otherwise it falls
   back to `ReachableFileCleanup`.
   
   Iceberg `1.6.x ~ 1.9.x` lack that guard: for a single ref they 
unconditionally
   pick the incremental strategy and are exposed. Bumping Iceberg core is not
   always an option for existing deployments, so this PR backports an equivalent
   guard at the Amoro layer.
   
   ## Brief change log
   
   - `IcebergTableMaintainer`: before committing `expireSnapshots`, check
     `hasSnapshotsOutsideMainAncestry()` — which walks the same
     `SnapshotUtil.ancestorIds` chain the cleanup uses — and force
     `ReachableFileCleanup` only when the table is in the at-risk state. Healthy
     tables keep Iceberg's automatic (incremental) selection, so there is no 
cost
     impact.
   - `ReachableFileCleanupBridge` (new, in package `org.apache.iceberg`):
     `RemoveSnapshots#withIncrementalCleanup(boolean)` is package-private, so 
the
     bridge calls it without reflection. Compile-time binding means a future
     signature change breaks the build instead of silently failing.
   
   `ReachableFileCleanup` does not rely on the ancestor walk; it deletes only 
the
   complement of the file set referenced by live snapshots, so it is immune to 
the
   walk truncation regardless of ref count.
   
   ## How was this patch tested?
   
   - [x] Added a regression test (`TestExpireSnapshotsKeepReferencedFiles`) that
     reproduces the incident: it builds the carry-over + broken-chain state and
     asserts that data files still referenced by the current snapshot survive
     expiration.
   - [ ] Add screenshots for manual tests if appropriate
   - [x] Run test locally before making a pull request
   
   ## Documentation
   
   - Does this pull request introduce a new feature? (no)
   - If yes, how is the feature documented? (not applicable)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to