tustvold opened a new pull request #379: URL: https://github.com/apache/arrow-datafusion/pull/379
Closes #362. Creating as draft as currently builds on top of #377 as it uses a partitioned SortExec as part of its tests. This PR adds a SortPreservingMergeExec operator that allows merging together multiple sorted partitions into a single partition. The main implementation is contained within SortPreservingMergeStream and SortKeyCursor: `SortKeyCursor` provides the ability to compare the sort keys of the next row that could be yielded for each stream, in order to determine which one to yield. `SortPreservingMergeStream` maintains a list of `SortKeyCursor` for each stream and builds up a list of sorted indices identifying rows within these cursors. When it reads the last row of a RecordBatch, it fetches another from the input. Once it has accumulated target_batch_size` row indexes (or exhausted all input streams) it will combine the relevant rows from the buffered RecordBatches into a single RecordBatch, drop any cursors it no longer needs, and yield the batch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
