yjshen edited a comment on pull request #1596: URL: https://github.com/apache/arrow-datafusion/pull/1596#issuecomment-1015308440
The logic for ExternalSort: 1. get batch from input, sort it, and buffer it in memory 2. when memory threshold meet, do "N-way merge" and spill the results to file 3. repeat 1 and 2 until input is exhausted. 4. another "N-way merge": merge current in-mem buffered batches as well as all spills, to get the total order. Currently, I've unified all "N-way merge" into `SortPreservingMergeStream`. Inside SPMS, I construct a min-heap size of "N" to minimize merging memory consumption. However, the performance deterioration of item-wise heap compare seems to overweight its benefits of memory saving according to the TPC-H lineitem sort results. The original sort algorithm is to `combine_batches` first and then sort the single batch. However, this suffers too much extra memory usage for "combine" first, which **doubles** the memory usage I think is not acceptable. Therefore, I think it is probably worth trying https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/merge_sort/mod.rs or bringing a new memory-efficient sort without combining first. cc @alamb @jorgecarleitao @houqp @tustvold. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org