SreeramGarlapati opened a new pull request, #2509:
URL: https://github.com/apache/iceberg-rust/pull/2509

   ## Which issue does this PR close?
   
   - Closes #2507.
   
   ## What changes are included in this PR?
   
   `validate_duplicate_files` was meant to refuse any append that would point 
at a file the table already references. It's been doing exactly half of that. 
The other half — refusing the same file appearing twice inside one batch — has 
been broken since the function was written, because the very first thing it 
does is collect the batch into a `HashSet`. The duplicates are gone before the 
check looks for them.
   
   Concretely: hand `FastAppendAction` two `DataFile`s with identical 
`file_path`s, call `commit()` with the default `check_duplicate=true`, and you 
get `Ok`. The resulting manifest holds two entries pointing at the same Parquet 
file, the snapshot summary's `added_files_count` is wrong, and every scan 
double-counts the rows. Nothing in the commit path notices.
   
   The fix is small: walk the added files instead of dumping them into a set, 
track every distinct path that collides, and return `DataInvalid` with the 
sorted list of offenders if any did. The existing cross-snapshot check reuses 
the same set and is otherwise unchanged, and `with_check_duplicate(false)` 
still disables both halves of the check together — the opt-out stays symmetric.
   
   This is a sibling of #1394, which fixed the cross-snapshot half of the same 
function. Same shape of bug, opposite side of the comparison.
   
   ## Are these changes tested?
   
   Yes — two unit tests in `crates/iceberg/src/transaction/append.rs`.
   
   The first throws three identical paths into a single batch and asserts the 
commit fails with `DataInvalid`, with the offending path appearing exactly once 
in the error message (so a future refactor can't quietly regress to repeating 
`a, a, a`).
   
   The second asserts that `with_check_duplicate(false)` still accepts batch 
duplicates — same opt-out behaviour the cross-snapshot check already documents.
   
   I also verified the repro by reverting just the `snapshot.rs` change and 
rerunning the first test against the unfixed code — it fails exactly where the 
panic-on-`Ok` branch lives, which is the bug demonstrated as a failing unit 
test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to