[I] Optimized version of `SortPreservingMerge` that doesn't actually compare sort keys of the key ranges are ordered [datafusion]

via GitHub Tue, 30 Apr 2024 09:58:03 -0700


alamb opened a new issue, #10316:
URL: https://github.com/apache/datafusion/issues/10316


   ### Is your feature request related to a problem or challenge?
   
   When merging a large number of pre-sorted streams (e.g. in our case, a large 
number of pre-sorted parquet files) the actual work in `SortPreservingMerge` to 
keep them sorted is often substantial  (as the sort key of each row in each 
stream must be compared the other potential candidates)
   
   Here is the sort preserving merge
   
https://github.com/apache/datafusion/blob/4edbdd7d09d97f361748c086afbd7b3dda972f76/datafusion/physical-plan/src/sorts/sort_preserving_merge.rs#L39-L67
   
   However, in some cases (such as @suremarc  has identified in 
https://github.com/apache/datafusion/issues/6672) we can use information about 
how the values of the sort key columns are distributed to avoid needing a sort
   
   For example, if we have three files that are each sorted by `time` and have 
the following ranges
   * file1.paquet `min(time) = 2024-01-01` and `max(time) = 2024-01-31`
   * file2.paquet `min(time) = 2024-02-01` and `max(time) = 2024-02-28`
   * file3.paquet `min(time) = 2024-03-01` and `max(time) = 2024-03-31`
   
   We can produce the output sorted stream by first reading file1.parquet 
entirely then file2.parquet, then file3.parquet
   
   Not only will this be faster than using `SortPreservingMerge` it will 
require less intermediate memory as we don't need to read a batch from each 
input stream to begin producing output. For cases where there may be 100s of 
files, this can minimize the amount of concurrently outstanding requests 
substantially
   
   Also, for a query that will not read the entire dataset (e.g. only wants the 
most recent values) it can be especially beneficial:
   
   ```sql
   SELECT * FROM data ORDER BY time limit 10
   ```
   
   In this case our example above would only ever read file1.parquet (wouldn't 
even open the others) if it had more than 10 rows
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   The original inspiration for this operator came from @pauldix  (who I think 
mentioned it was inspired by ElasticSearch)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Optimized version of `SortPreservingMerge` that doesn't actually compare sort keys of the key ranges are ordered [datafusion]

Reply via email to