pchintar opened a new pull request, #9991:
URL: https://github.com/apache/arrow-rs/pull/9991

   # Which issue does this PR close?
   
   - Closes #9990 .
   
   # Rationale for this change
   
   `lexsort_to_indices` currently materializes row indices for the full input 
before applying the requested lexsort limit.
   
   For example, with:
   
   ```text
   row_count = 4096
   limit = 10
   ```
   
   the current implementation still allocates and initializes indices for all 
4096 rows:
   
   ```rust
   let mut value_indices = (0..row_count).collect::<Vec<usize>>();
   ```
   
   even though only the first 10 sorted indices are returned.
   
   This PR reduces allocation and sorting work for small-limit lexsorts by 
avoiding full index materialization when the requested limit is a small 
fraction of the input size.
   
   # What changes are included in this PR?
   
   This PR adds a bounded top-k heap path for small-limit lexsorts in 
`arrow-ord/src/sort.rs`.
   
   The new path:
   
   * maintains only the current top-k row indices while scanning the input
   * avoids allocating indices for every row
   * sorts only the retained top-k indices before returning the result
   
   The existing partial-sort implementation remains unchanged for larger limits.
   
   To support the bounded heap implementation, this PR also adds:
   
   - `lexsort_topk_fixed`
     - Uses the existing fixed-column comparator with the bounded heap path.
   
   - `lexsort_topk`
     - Keeps only the current top-k row indices in a bounded max-heap while 
scanning rows.
   
   - `sift_up_worst_heap`
     - Moves a newly inserted row index upward until the heap order is restored.
   
   - `sift_down_worst_heap`
     - Moves the current worst retained row index downward after replacement.
   
   # Are these changes tested?
   
   Yes.
   
   Existing tests pass:
   
   ```text
   cargo test -p arrow-ord
   cargo test -p arrow --features test_utils
   ```
   
   My local Benchmark results from:
   
   ```text
   cargo bench --package arrow --bench sort_kernel --features test_utils -- 
"limit"
   ```
   
   show improvements for small-limit lexsorts such as limits 10, 100.
   
   # Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to