pchintar opened a new issue, #9990:
URL: https://github.com/apache/arrow-rs/issues/9990

   ## Description
   
   Currently, `lexsort_to_indices` materializes row indices for the full input 
before applying the requested lexsort limit.
   
   For example, with:
   
   ```text
   row_count = 4096
   limit = 10
   ````
   
   the implementation still creates:
   
   ```rust
   let mut value_indices = (0..row_count).collect::<Vec<usize>>();
   ```
   
   so all 4096 row indices are allocated and initialized even though only 10 
sorted indices are returned.
   
   ## Root Cause
   
   In `arrow-ord/src/sort.rs`, `lexsort_to_indices` first builds an index 
vector for every row:
   
   ```rust
   let mut value_indices = (0..row_count).collect::<Vec<usize>>();
   let mut len = value_indices.len();
   
   if let Some(limit) = limit {
       len = limit.min(len);
   }
   ```
   
   The limit only changes `len`, which is then passed to the sort helper:
   
   ```rust
   sort_fixed_column::<2>(columns, &mut value_indices, len)?;
   ```
   
   or, for the generic path:
   
   ```rust
   sort_unstable_by(&mut value_indices, len, |a, b| {
       lexicographical_comparator.compare(*a, *b)
   });
   ```
   
   The result is finally built from the sorted prefix:
   
   ```rust
   value_indices[..len]
       .iter()
       .map(|i| *i as u32)
       .collect::<Vec<_>>()
   ```
   
   As a result, bounded lexsorts still allocate and initialize indices for rows 
outside the returned prefix.
   
   ## Current Behavior
   
   For bounded lexsorts:
   
   ```text
   all rows
       -> allocate and initialize one index per row
       -> sort only the requested prefix
       -> return the requested prefix
   ```
   
   This is correct, but it does unnecessary index allocation for small limits.
   
   ## Implications
   
   Small-limit lexsorts spend time and memory creating indices that cannot 
appear in the returned result unless they are selected into the requested 
prefix.
   
   This adds avoidable work for cases such as:
   
   ```text
   lexsort ... limit 10
   lexsort ... limit 100
   ```
   
   where the requested output is much smaller than the input.
   
   ## Proposed Solution
   
   For sufficiently small limits, maintain only the current top-k row indices 
while scanning the input, instead of first materializing every row index.
   
   The existing partial-sort path should remain in use for larger limits, where 
maintaining a bounded heap is not beneficial.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to