yjshen commented on pull request #1596: URL: https://github.com/apache/arrow-datafusion/pull/1596#issuecomment-1015308440
The logic for ExternalSort: 1. get batch from input, sort it, and buffer it in memory 2. when memory threshold meet, do in-mem-sort to do "N-way merge" and spill it to file 3. repeat 1 and 2 until input is exhausted. 4. another "N-way merge": merge current in-mem buffered batches as well as all spills, to get the total order. Currently, I've unified all "N-way merge" into `SortPreservingMergeStream`. Inside SPMS, I construct a min-heap size of "N" to minimize merging memory consumption. However, the performance deterioration of item-wise heap compare seems to overweight its benefits of memory saving. I think it is probably worth porting https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/merge_sort/mod.rs to avoid too much memory consumption during `combine_batches` in the original sort implementation. cc @alamb @jorgecarleitao @houqp @tustvold. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
