suremarc commented on code in PR #9593:
URL: https://github.com/apache/arrow-datafusion/pull/9593#discussion_r1534573850
##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -194,6 +203,71 @@ impl FileScanConfig {
.with_repartition_file_min_size(repartition_file_min_size)
.repartition_file_groups(&file_groups)
}
+
+ /// Attempts to do a bin-packing on files into file groups, such that any
two files
+ /// in a file group are ordered and non-overlapping with respect to their
statistics.
+ /// It will produce the smallest number of file groups possible.
Review Comment:
If I understand correctly what you're asking, I think it will already do
this -- see for example the sqllogictest I added:
https://github.com/apache/arrow-datafusion/blob/e9fad54257a13ab9630a0f038fa2b5a1b26aa006/datafusion/sqllogictest/test_files/parquet.slt#L193-L203
As you can see, the physical plan has no sort. The `FileScanConfig`
correctly detects that the files are ordered based on their statistics, and
this information is propagated to the physical planner. So fortunately this
works already.
As for next steps, it actually just occurred to me that if there are 1000
files that all overlap, after this PR `ListingTable` will produce 1000 file
groups which is really bad. So I think I should add a check that only uses the
sorted file groups if they produce less than `target_partitions` file groups. I
will add that in this PR.
The other thing we should maybe do is modify `sort_file_groups` to
_increase_ the number of file groups if it's lower than `target_partitions`,
but I also thought maybe we should let DataFusion's `EnforceDistribution`
physical optimizer take care of this, so it might not be necessary.
Other than that I'm not sure immediately sure what next steps are, without
gathering more data from real-world usage. Though, I would like to add some
benchmarks to DataFusion for large plans (with thousands of files), as I think
the performance of this code might become more apparent
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]