viirya commented on PR #2509: URL: https://github.com/apache/iceberg-rust/pull/2509#issuecomment-4586094643
Nice catch on the `HashSet`-eats-the-duplicates bug — the fail-fast rewrite reads cleanly. Two things before this lands, both building on @blackmwk's open comments: **The batch-internal check probably shouldn't be gated on `check_duplicate`.** I agree with @blackmwk's point on `append.rs:88`. The `check_duplicate` flag exists to let callers skip the *cross-snapshot* validation, which costs manifest I/O. The intra-batch check is pure in-memory work with no such cost, and skipping it means `commit()` can silently write a manifest that references the same `file_path` twice — i.e. exactly the corruption this PR is meant to prevent. I'd run `check_no_duplicate_paths_in_batch` unconditionally. Worth flagging that `test_fast_append_intra_batch_duplicate_check_can_be_disabled` currently *locks in* that behavior — it asserts two identical paths commit successfully when the flag is off. If the check becomes unconditional, that test's assertion should flip to expecting an error. **Scope is `FastAppendAction` only.** This fixes the fast-append path; it'd be worth a quick follow-up to confirm `MergeAppend` / overwrite don't share the same latent issue (they may route through different validation). Not blocking this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
