alamb opened a new issue, #10316: URL: https://github.com/apache/datafusion/issues/10316
### Is your feature request related to a problem or challenge? When merging a large number of pre-sorted streams (e.g. in our case, a large number of pre-sorted parquet files) the actual work in `SortPreservingMerge` to keep them sorted is often substantial (as the sort key of each row in each stream must be compared the other potential candidates) Here is the sort preserving merge https://github.com/apache/datafusion/blob/4edbdd7d09d97f361748c086afbd7b3dda972f76/datafusion/physical-plan/src/sorts/sort_preserving_merge.rs#L39-L67 However, in some cases (such as @suremarc has identified in https://github.com/apache/datafusion/issues/6672) we can use information about how the values of the sort key columns are distributed to avoid needing a sort For example, if we have three files that are each sorted by `time` and have the following ranges * file1.paquet `min(time) = 2024-01-01` and `max(time) = 2024-01-31` * file2.paquet `min(time) = 2024-02-01` and `max(time) = 2024-02-28` * file3.paquet `min(time) = 2024-03-01` and `max(time) = 2024-03-31` We can produce the output sorted stream by first reading file1.parquet entirely then file2.parquet, then file3.parquet Not only will this be faster than using `SortPreservingMerge` it will require less intermediate memory as we don't need to read a batch from each input stream to begin producing output. For cases where there may be 100s of files, this can minimize the amount of concurrently outstanding requests substantially Also, for a query that will not read the entire dataset (e.g. only wants the most recent values) it can be especially beneficial: ```sql SELECT * FROM data ORDER BY time limit 10 ``` In this case our example above would only ever read file1.parquet (wouldn't even open the others) if it had more than 10 rows ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context The original inspiration for this operator came from @pauldix (who I think mentioned it was inspired by ElasticSearch) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
