Dandandan commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2804304348

   I think `concat` followed by `sort` is slower because
   
   * Concat involves copying the entire batch (rather than only the keys to be 
sorted)
   * `sort_batch_stream` Can be slower as `lexsort_to_indices` is in cases with 
many columns slower than the Row Format
   
   I think for `ExternalSorter` we don't want any additional parallelism as the 
sort is already executed per partition.
   
   The core improvement:
   
   * minimizing copying of the input batches to one (only for the output)
   * sorting on the input batches rather than sort followed by merge
   * A good heuristic on when to switch from `lexsort_to_indices`-like sorting 
to RowConverter + sorting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to