zhuqi-lucas opened a new issue, #21381: URL: https://github.com/apache/datafusion/issues/21381
**Is your feature request related to a problem or challenge?** When `SortExec` is eliminated via sort pushdown (statistics-based file reordering), `SortPreservingMergeExec` reads directly from I/O-bound sources instead of from `SortExec`'s in-memory buffer. Currently SPM does a single K-way merge of all input streams, which can become a bottleneck when there are many partitions. **Describe the solution you'd like** Implement parallel merge for `SortPreservingMergeExec`: split the N input streams into groups, merge each group in parallel, then merge the intermediate results. This creates a tree of merges instead of a single flat K-way merge. For example, with 8 input streams: ``` Level 1 (parallel): merge(s1,s2), merge(s3,s4), merge(s5,s6), merge(s7,s8) Level 2 (parallel): merge(m1,m2), merge(m3,m4) Level 3: merge(m5,m6) → final output ``` This would be especially beneficial when: - Sort elimination removes the buffering `SortExec`, making SPM I/O-bound - Many partitions with I/O-bound sources - Large datasets where merge computation itself becomes a bottleneck **Additional context** Suggested by @Dandandan in https://github.com/apache/datafusion/pull/21182#discussion_r3036542606: > Also slightly looking forward: I think we could benefit from _parallel merge_ (e.g. finding some split in the n streams and merging in parallel) in these situations where sorting becomes mostly about merging. Related: DuckDB implements a similar [parallel merge sort](https://duckdb.org/2021/08/27/external-sorting.html) strategy. Parent issue: https://github.com/apache/datafusion/issues/17348 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
