alamb commented on issue #6672:
URL: 
https://github.com/apache/arrow-datafusion/issues/6672#issuecomment-1608113109

   > I have had a somewhat overlapping (no pun intended) issue where DataFusion 
abandons the SortPreservingMergeStream and does a global sort if there are 
multiple files in any file groups. It should be possible for DataFusion to 
realize that, if the files are non-overlapping, the file groups can be 
re-ordered to satisfy the required output ordering. 
   
   Yes, that is correct -- each partition stream from the parquet reader is 
produced back to back, so if there are multiple files, the resulting stream is 
not ordered even if all the input files were
   
   
   > We would be partitioning a poset of files into a series of chains, where A 
< B if they are non-overlapping, and every row in A goes before every row in B. 
   
   Indeed as long as each output group was ordered in non overlapping time the 
parquet reader would not need to be changed at all
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to