SreeramGarlapati opened a new issue, #2507:
URL: https://github.com/apache/iceberg-rust/issues/2507
### Apache Iceberg Rust version
`main` (observed against the v0.9 line; same code present on `main` at the
time of filing).
### Describe the bug
`FastAppendAction::commit` with `with_check_duplicate(true)` (the default)
does not detect duplicate `file_path` values **within the batch being
appended**. Two `DataFile` entries carrying the same `file_path` in a single
`add_data_files(...)` call are accepted, written into the manifest, and
committed without error.
The root cause is in `SnapshotProducer::validate_duplicate_files`
(`crates/iceberg/src/transaction/snapshot.rs`):
```rust
let new_files: HashSet<&str> = self
.added_data_files
.iter()
.map(|df| df.file_path.as_str())
.collect();
```
Collecting into a `HashSet` silently dedupes the batch before the check
runs. The subsequent loop then only compares the (already-deduped) set against
entries in existing manifests — it never observes that the batch itself
contained the same path twice. The pre-existing fix for #1394 only addressed
the cross-snapshot side of the check; the intra-batch side has the same shape
of bug.
### Impact
Once the duplicate-bearing snapshot is committed:
- A scan reads the same physical file twice, returning duplicate rows.
- `added_files_count` / `added_records` in the snapshot summary are inflated
relative to physical storage.
- Nothing in the commit path surfaces the violation — corruption is silent.
This is reachable from any upstream file-discovery or retry path that can
hand the same path to `add_data_files` more than once.
### To Reproduce
```rust
let table = make_v2_minimal_table();
let tx = Transaction::new(&table);
let mk = |size, records| DataFileBuilder::default()
.content(DataContentType::Data)
.file_path(\"test/dup.parquet\".to_string()) // identical path
.file_format(DataFileFormat::Parquet)
.file_size_in_bytes(size)
.record_count(records)
.partition_spec_id(table.metadata().default_partition_spec_id())
.partition(Struct::from_iter([Some(Literal::long(1))]))
.build()
.unwrap();
let action = tx
.fast_append()
.with_check_duplicate(true)
.add_data_files(vec![mk(100, 10), mk(200, 20)]);
// Today: returns Ok and writes a manifest with two entries pointing at the
same file.
// Expected: returns Err(DataInvalid).
Arc::new(action).commit(&table).await.unwrap();
```
### Expected behavior
`validate_duplicate_files` should reject any commit whose batch contains the
same `file_path` more than once, with `ErrorKind::DataInvalid` and a message
listing the offending paths.
### Willingness to contribute
Yes — fix and tests prepared, PR incoming.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]