alamb commented on issue #6672: URL: https://github.com/apache/arrow-datafusion/issues/6672#issuecomment-1608113109
> I have had a somewhat overlapping (no pun intended) issue where DataFusion abandons the SortPreservingMergeStream and does a global sort if there are multiple files in any file groups. It should be possible for DataFusion to realize that, if the files are non-overlapping, the file groups can be re-ordered to satisfy the required output ordering. Yes, that is correct -- each partition stream from the parquet reader is produced back to back, so if there are multiple files, the resulting stream is not ordered even if all the input files were > We would be partitioning a poset of files into a series of chains, where A < B if they are non-overlapping, and every row in A goes before every row in B. Indeed as long as each output group was ordered in non overlapping time the parquet reader would not need to be changed at all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
