SreeramGarlapati opened a new pull request, #2509: URL: https://github.com/apache/iceberg-rust/pull/2509
## Which issue does this PR close? - Closes #2507. ## What changes are included in this PR? `validate_duplicate_files` was meant to refuse any append that would point at a file the table already references. It's been doing exactly half of that. The other half — refusing the same file appearing twice inside one batch — has been broken since the function was written, because the very first thing it does is collect the batch into a `HashSet`. The duplicates are gone before the check looks for them. Concretely: hand `FastAppendAction` two `DataFile`s with identical `file_path`s, call `commit()` with the default `check_duplicate=true`, and you get `Ok`. The resulting manifest holds two entries pointing at the same Parquet file, the snapshot summary's `added_files_count` is wrong, and every scan double-counts the rows. Nothing in the commit path notices. The fix is small: walk the added files instead of dumping them into a set, track every distinct path that collides, and return `DataInvalid` with the sorted list of offenders if any did. The existing cross-snapshot check reuses the same set and is otherwise unchanged, and `with_check_duplicate(false)` still disables both halves of the check together — the opt-out stays symmetric. This is a sibling of #1394, which fixed the cross-snapshot half of the same function. Same shape of bug, opposite side of the comparison. ## Are these changes tested? Yes — two unit tests in `crates/iceberg/src/transaction/append.rs`. The first throws three identical paths into a single batch and asserts the commit fails with `DataInvalid`, with the offending path appearing exactly once in the error message (so a future refactor can't quietly regress to repeating `a, a, a`). The second asserts that `with_check_duplicate(false)` still accepts batch duplicates — same opt-out behaviour the cross-snapshot check already documents. I also verified the repro by reverting just the `snapshot.rs` change and rerunning the first test against the unfixed code — it fails exactly where the panic-on-`Ok` branch lives, which is the bug demonstrated as a failing unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
