mbutrovich commented on issue #9110:
URL: https://github.com/apache/arrow-rs/issues/9110#issuecomment-4594337839

   I'd like to put forward #9683 as a requested item for this release.
   
   It adds an MSD radix sort kernel (`arrow_row::radix::radix_sort_to_indices`) 
that sorts directly on row-encoded keys, taking advantage of the row format 
already being big-endian and memcmp-comparable. It consistently beats 
`lexsort_to_indices` for multi-column sorts (up to ~2.5x at 32K rows), and the 
docs are upfront about where it loses.
   
   Worth landing because it's:
   
   - **Correct** — 17 tests covering sort-option combinations, nulls, NaN/Inf, 
threshold boundaries, a fuzz test, and cross-validation against comparison sort 
on the same `Rows`.
   - **Well documented** — clear guidance on when to use it vs 
`lexsort_to_indices`, backed by benchmarks.
   - **Foundational** — operates on `Rows`, returns indices, composes with 
`take`. A building block for downstream sorting work, not a full sort-path 
replacement.
   
   There's already downstream interest in building on it. I have a series of 
(now-closed) experimental DataFusion PRs exploring integration into 
`ExternalSorter`:
   
   - apache/datafusion#21525 — bring over the kernel, integrate into 
`ExternalSorter`
   - apache/datafusion#21600 — chunked sort pipeline + radix kernel
   - apache/datafusion#21629 — coalesce batches before sorting to reduce merge 
fan-in
   - apache/datafusion#21688 — further `ExternalSorter` refactor
   
   These showed the kernel's utility but also that getting a real win needs a 
careful redesign of the sort stream rather than a drop-in swap. Landing the 
kernel in a release would give DataFusion (and other downstream consumers) 
something stable to build that work against.
   
   CC @Dandandan
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to