[GitHub] [arrow-datafusion] tustvold opened a new issue, #5230: Use Arrow Row Format in SortExec

via GitHub Thu, 09 Feb 2023 10:53:35 -0800


tustvold opened a new issue, #5230:
URL: https://github.com/apache/arrow-datafusion/issues/5230


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `SortPreservingMerge` now makes use of the [arrow row 
format](https://docs.rs/arrow-row/latest/arrow_row/) and this has yielded 
significant performance improvements over the prior DynComparator based 
approach. We can likely signifcantly improve the performance of `SortExec` by 
modifying `sort_batch` to make use of the row format when performing 
multi-column sorts instead of `lexsort_to_indices`, which internally uses 
DynComparator.
   
   For single-column sorts `lexsort_to_indices` will call through to 
`sort_to_indices` which will be faster than the row format, we should make sure 
to keep this special case.
   
   **Describe the solution you'd like**
   
   A first iteration could simply modify `sort_batch` to use the row format for 
multi-column sorts, as demonstrated 
[here](https://docs.rs/arrow-row/latest/arrow_row/#lexsort), falling back to 
`sort_to_indices` if only a single column.
   
   A second iteration could then look to find a way to convert to the row 
format once, and preserve this encoding when feeding sorted batches into 
`SortPreservingMerge`.
   
   **Describe alternatives you've considered**
   We could not do this
   
   **Additional context**
   
   FYI @alamb @mustafasrepo @ozankabak 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold opened a new issue, #5230: Use Arrow Row Format in SortExec

Reply via email to