tustvold opened a new pull request #379:
URL: https://github.com/apache/arrow-datafusion/pull/379


   Closes #362.
   
   Creating as draft as currently builds on top of #377 as it uses a 
partitioned SortExec as part of its tests.
   
   This PR adds a SortPreservingMergeExec operator that allows merging together 
multiple sorted partitions into a single partition.
   
   The main implementation is contained within SortPreservingMergeStream and 
SortKeyCursor:
   
   `SortKeyCursor` provides the ability to compare the sort keys of the next 
row that could be yielded for each stream, in order to determine which one to 
yield.
   
   `SortPreservingMergeStream` maintains a list of `SortKeyCursor` for each 
stream and builds up a list of sorted indices identifying rows within these 
cursors. When it reads the last row of a RecordBatch, it fetches another from 
the input. Once it has accumulated target_batch_size` row indexes (or exhausted 
all input streams) it will combine the relevant rows from the buffered 
RecordBatches into a single RecordBatch, drop any cursors it no longer needs, 
and yield the batch.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to