yjshen commented on issue #1572:
URL: 
https://github.com/apache/arrow-datafusion/issues/1572#issuecomment-1013680326


   Thanks @alamb for bringing it up.
   
   I propose using heap-sort for N-way merge, but consolidate all the codes we 
have now in `in_mem_sort` and `SortPreservingMergeStream(SPMS)` into  SPMS, and 
remove `in_mem_sort`. 
   
   The main reason to use `heap_sort` is to minimize the number of comparisons 
before emitting each record. Since we have "N" partial ordered parts already, 
we only need to maintain a heap of size "N" at a time. And benefit from the 
complexity of O(1) to take out of the smallest record, and O(log n) for new 
item insertion if the last smallest stream or record batch is not exhausted. 
   
   For the current sorting methods in SPMS, I think it's to compare all 
starting records from all streams at once for each time we emit the least. The 
comparison might not be an issue when N is relatively small. But would 
deteriorate fast as N grows.
   
   @tustvold, what's your opinion on the merging algorithm to choose?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to