suremarc commented on code in PR #9593:
URL: https://github.com/apache/arrow-datafusion/pull/9593#discussion_r1539550039


##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -194,6 +203,71 @@ impl FileScanConfig {
             .with_repartition_file_min_size(repartition_file_min_size)
             .repartition_file_groups(&file_groups)
     }
+
+    /// Attempts to do a bin-packing on files into file groups, such that any 
two files
+    /// in a file group are ordered and non-overlapping with respect to their 
statistics.
+    /// It will produce the smallest number of file groups possible.

Review Comment:
   @alamb I didn't have any data in mind, so any headstart we can get sounds 
great.
   
   Regarding the `SortPreservingMerge`, I want to state again that 
`sort_file_groups` will produce (if possible) only one file group with all 
files sequentially ordered, so I think the plan will not have a 
sort-preserving-merge in the first place. I think maybe I just need to make 
this clearer in a separate sqllogictest. But crucially this means that #6672 
would already be fixed by this PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to