yjshen edited a comment on pull request #1596:
URL: 
https://github.com/apache/arrow-datafusion/pull/1596#issuecomment-1015308440


   The logic for ExternalSort:
   1. get batch from input, sort it, and buffer it in memory
   2. when memory threshold meet,  do "N-way merge" and spill the results to 
file
   3. repeat 1 and 2 until input is exhausted.
   4. another "N-way merge": merge current in-mem buffered batches as well as 
all spills, to get the total order.
   
   Currently, I've unified all "N-way merge" into `SortPreservingMergeStream`. 
Inside SPMS, I construct a min-heap size of "N" to minimize merging memory 
consumption. However, the performance deterioration of item-wise heap compare 
seems to overweight its benefits of memory saving according to the TPC-H 
lineitem sort results.
   
   The original sort algorithm is to `combine_batches` first and then sort the 
single batch. However, this suffers too much extra memory usage for "combine" 
first, which **doubles** the memory usage I think is not acceptable. 
   
   Therefore, I think it is probably worth trying 
https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/merge_sort/mod.rs
 or bringing a new memory-efficient sort without combining first.
   
   cc @alamb @jorgecarleitao @houqp @tustvold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to