tongwai-wong-appier commented on issue #13763:
URL: https://github.com/apache/iceberg/issues/13763#issuecomment-4438421059

   @kumarpritam863  Thanks for the explanation. That helps clarify the current 
offset-validation path, especially for what I would call the "stale replay" 
case.
   
   To make sure I understand the guarantees correctly, would it be reasonable 
to separate the situations into these three cases?
   
   1. **Case 1: the exact same `DATA_WRITTEN(file X)` control-topic record is 
replayed**
      - same control-topic offset
      - replayed after recovery / coordinator switchover
      - from reading current `main`, this appears to be covered by:
        - filtering against committed control-topic offsets before 
`commitToTable()`
        - plus `SnapshotAncestryValidator`
   
   2. **Case 2: two distinct `DATA_WRITTEN(file X)` records are buffered, but 
both land in the same commit**
      - from reading current `main`, this appears to be covered by
        `distinctByKey(ContentFile::location)` inside `commitToTable()`
   
   3. **Case 3: two distinct `DATA_WRITTEN(file X)` records are produced across 
two different commit cycles / snapshots**
      - for example:
        - commit A already appends file `X` into snapshot `S1`
        - then a later `StartCommit` causes a new `DATA_WRITTEN(file X)` event 
to be emitted again
        - that later event has a new control-topic offset and may belong to a 
new commitId
        - then commit B attempts to append the same physical file `X` into 
snapshot `S2`
   
   Our current suspicion is that our incident may be closer to **Case 3**, not 
Case 1.
   
   So the question I want to confirm is:
   
   > Does current `main` also guarantee deduplication / rejection for **Case 
3**?
   >
   > In other words, if the second `DATA_WRITTEN(file X)` is not a replay of 
the old control-topic record, but a newly produced control-topic event with a 
newer offset, which part of the current logic prevents `file X` from being 
appended again into a later snapshot?
   
   The reason I am asking is that the current protections seem clearly tied to:
   - stale control-topic offset replay, and
   - concurrent commit validation
   
   but I am not yet seeing an obvious cross-snapshot file-level idempotency 
check for the "same physical file, new control-topic event" case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to