yjshen commented on pull request #1596:
URL: 
https://github.com/apache/arrow-datafusion/pull/1596#issuecomment-1015308440


   The logic for ExternalSort:
   1. get batch from input, sort it, and buffer it in memory
   2. when memory threshold meet, do in-mem-sort to do "N-way merge" and spill 
it to file
   3. repeat 1 and 2 until input is exhausted.
   4. another "N-way merge": merge current in-mem buffered batches as well as 
all spills, to get the total order.
   
   Currently, I've unified all "N-way merge" into `SortPreservingMergeStream`. 
Inside SPMS, I construct a min-heap size of "N" to minimize merging memory 
consumption. However, the performance deterioration of item-wise heap compare 
seems to overweight its benefits of memory saving. 
   
   I think it is probably worth porting 
https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/merge_sort/mod.rs
 to avoid too much memory consumption during `combine_batches` in the original 
sort implementation.
   
   cc @alamb @jorgecarleitao @houqp @tustvold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to