xanderbailey commented on code in PR #2509:
URL: https://github.com/apache/iceberg-rust/pull/2509#discussion_r3320658707


##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.

Review Comment:
   This comment is maybe a little verbose? 



##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -164,11 +164,28 @@ impl<'a> SnapshotProducer<'a> {
     }
 
     pub(crate) async fn validate_duplicate_files(&self) -> Result<()> {
-        let new_files: HashSet<&str> = self
-            .added_data_files
-            .iter()
-            .map(|df| df.file_path.as_str())
-            .collect();
+        // First, detect duplicate file paths within the batch itself.
+        // Collecting straight into a `HashSet` would silently deduplicate
+        // and pass corruption through to the manifest, so we walk the list
+        // explicitly and surface every distinct duplicated path.
+        let mut new_files: HashSet<&str> = 
HashSet::with_capacity(self.added_data_files.len());
+        let mut duplicates_in_batch: HashSet<&str> = HashSet::new();
+        for data_file in &self.added_data_files {
+            if !new_files.insert(data_file.file_path.as_str()) {
+                duplicates_in_batch.insert(data_file.file_path.as_str());
+            }
+        }
+        if !duplicates_in_batch.is_empty() {
+            let mut paths: Vec<&str> = 
duplicates_in_batch.into_iter().collect();
+            paths.sort_unstable();

Review Comment:
   Curious to know if we need the sort?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to