Dandandan commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2804304348
I think `concat` followed by `sort` is slower because * Concat involves copying the entire batch (rather than only the keys to be sorted) * `sort_batch_stream` Can be slower as `lexsort_to_indices` is in cases with many columns slower than the Row Format I think for `ExternalSorter` we don't want any additional parallelism as the sort is already executed per partition. The core improvement: * minimizing copying of the input batches to one (only for the output) * sorting on the input batches rather than sort followed by merge * A good heuristic on when to switch from `lexsort_to_indices`-like sorting to RowConverter + sorting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org