blackmwk commented on code in PR #2509:
URL: https://github.com/apache/iceberg-rust/pull/2509#discussion_r3323713727


##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.
+        let mut new_files: HashSet<&str> = 
HashSet::with_capacity(self.added_data_files.len());
+        let mut duplicates_in_batch: HashSet<&str> = HashSet::new();

Review Comment:
   I think this is over complicated, we should throw error immediately when 
`new_files.insert` return false.



##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.
+        let mut new_files: HashSet<&str> = 
HashSet::with_capacity(self.added_data_files.len());
+        let mut duplicates_in_batch: HashSet<&str> = HashSet::new();

Review Comment:
   Also I prefer to move this check in the begigning of `FastAppendAction`'s 
`commit` method, rather than putting it here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to