suremarc commented on issue #6672:
URL: 
https://github.com/apache/arrow-datafusion/issues/6672#issuecomment-1608027489

   I have had a somewhat overlapping (no pun intended) issue where DataFusion 
abandons the `SortPreservingMergeStream` and does a global sort if there are 
multiple files in any file groups. It should be possible for DataFusion to 
realize that, if the files are non-overlapping, the file groups can be 
re-ordered to satisfy the required output ordering. We would be partitioning a 
poset of files into a series of chains, where A < B if they are 
non-overlapping, and every row in A goes before every row in B. Then each chain 
becomes one file group in the physical plan, which would be read sequentially. 
Using statistics and partition columns it should be possible to perform this 
analysis without reading any rows. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to