SreeramGarlapati commented on code in PR #2509:
URL: https://github.com/apache/iceberg-rust/pull/2509#discussion_r3327850150


##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.

Review Comment:
   ya - shouldn't have been there. dropped it entirely; the function name + err 
message convey it.



##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.
+        let mut new_files: HashSet<&str> = 
HashSet::with_capacity(self.added_data_files.len());
+        let mut duplicates_in_batch: HashSet<&str> = HashSet::new();
+        for data_file in &self.added_data_files {
+            if !new_files.insert(data_file.file_path.as_str()) {
+                duplicates_in_batch.insert(data_file.file_path.as_str());
+            }
+        }
+        if !duplicates_in_batch.is_empty() {
+            let mut paths: Vec<&str> = 
duplicates_in_batch.into_iter().collect();
+            paths.sort_unstable();

Review Comment:
   moot after blackmwk's fail-fast suggestion (below) - we now report just the 
first dup, no list to sort.



##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.
+        let mut new_files: HashSet<&str> = 
HashSet::with_capacity(self.added_data_files.len());
+        let mut duplicates_in_batch: HashSet<&str> = HashSet::new();

Review Comment:
   agreed - simplified to fail-fast on first `insert` collision. no extra set, 
no sort.



##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.
+        let mut new_files: HashSet<&str> = 
HashSet::with_capacity(self.added_data_files.len());
+        let mut duplicates_in_batch: HashSet<&str> = HashSet::new();

Review Comment:
   done - extracted as a free fn `check_no_duplicate_paths_in_batch` in 
`transaction::snapshot`, called at the top of `FastAppendAction::commit` under 
the same `check_duplicate` gate. `SnapshotProducer::validate_duplicate_files` 
is now purely the cross-snapshot check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to