suremarc commented on code in PR #9593:
URL: https://github.com/apache/arrow-datafusion/pull/9593#discussion_r1534573850


##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -194,6 +203,71 @@ impl FileScanConfig {
             .with_repartition_file_min_size(repartition_file_min_size)
             .repartition_file_groups(&file_groups)
     }
+
+    /// Attempts to do a bin-packing on files into file groups, such that any 
two files
+    /// in a file group are ordered and non-overlapping with respect to their 
statistics.
+    /// It will produce the smallest number of file groups possible.

Review Comment:
   If I understand correctly what you're asking, I think it will already do 
this -- see for example the sqllogictest I added: 
https://github.com/apache/arrow-datafusion/blob/e9fad54257a13ab9630a0f038fa2b5a1b26aa006/datafusion/sqllogictest/test_files/parquet.slt#L193-L203
   
   As you can see, the physical plan has no sort. The `FileScanConfig` 
correctly detects that the files are ordered based on their statistics, and 
this information is propagated to the physical planner. So fortunately this 
works already. 
   
   As for next steps, it actually just occurred to me that if there are 1000 
files that all overlap, after this PR `ListingTable` will produce 1000 file 
groups which is really bad. So I think I should add a check that only uses the 
sorted file groups if they produce less than `target_partitions` file groups. I 
will add that in this PR. 
   
   The other thing we should maybe do is modify `sort_file_groups` to 
_increase_ the number of file groups if it's lower than `target_partitions`, 
but I also thought maybe we should let DataFusion's `EnforceDistribution` 
physical optimizer take care of this, so it might not be necessary. 
   
   Other than that I'm not sure immediately sure what next steps are, without 
gathering more data from real-world usage. Though, I would like to add some 
benchmarks to DataFusion for large plans (with thousands of files), as I think 
the performance of this code might become more apparent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to