mbutrovich opened a new pull request, #21525:
URL: https://github.com/apache/datafusion/pull/21525

   ## Which issue does this PR close?
   
   N/A.
   
   ## Rationale for this change
   
   MSD radix sort on row-encoded keys is 2-3x faster than `lexsort_to_indices` 
for most multi-column sorts. See https://github.com/apache/arrow-rs/pull/9683 
for benchmarks and design rationale.
   
   The arrow-rs kernel hasn't been released yet, so this PR temporarily hoists 
the implementation into DataFusion. Once arrow-rs ships it, we replace the 
local copy with the import.
   
   ## What changes are included in this PR?
   - **`sorts/radix.rs`** (new): MSD radix sort kernel + `RowConverter` wiring, 
copied from apache/arrow-rs#9683.
   - **`sorts/sort.rs`**: `sort_batch` branches to radix sort when `fetch` is 
not set and column types are favorable. `ExternalSorter` precomputes the 
radix-vs-lexsort decision once at construction rather than per-batch.
   - **`sorts/stream.rs`**: `IncrementalSortIterator` accepts the precomputed 
`use_radix` flag.
   
   The heuristic falls back to `lexsort_to_indices` when:
   - `fetch` is set (partial sort / top-N)
   - All sort columns are dictionary-typed (long shared row prefixes)
   - Any sort column is a nested type (lists, structs — encoding cost outweighs 
benefit)
   
   ## Are these changes tested?
   4 new tests:
   - Multi-column integer sort verifying radix path correctness
   - Heuristic unit test (primitives, dictionaries, lists, mixed)
   - Nulls with descending + nulls_first sort options
   - 50-iteration fuzz test cross-validating radix output against lexsort
   
   The kernel itself has 17 tests in the arrow-rs PR. The 188 existing sort 
tests in `datafusion-physical-plan` continue to pass.
   
   ## Are there any user-facing changes?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to