pchintar opened a new issue, #9990:
URL: https://github.com/apache/arrow-rs/issues/9990
## Description
Currently, `lexsort_to_indices` materializes row indices for the full input
before applying the requested lexsort limit.
For example, with:
```text
row_count = 4096
limit = 10
````
the implementation still creates:
```rust
let mut value_indices = (0..row_count).collect::<Vec<usize>>();
```
so all 4096 row indices are allocated and initialized even though only 10
sorted indices are returned.
## Root Cause
In `arrow-ord/src/sort.rs`, `lexsort_to_indices` first builds an index
vector for every row:
```rust
let mut value_indices = (0..row_count).collect::<Vec<usize>>();
let mut len = value_indices.len();
if let Some(limit) = limit {
len = limit.min(len);
}
```
The limit only changes `len`, which is then passed to the sort helper:
```rust
sort_fixed_column::<2>(columns, &mut value_indices, len)?;
```
or, for the generic path:
```rust
sort_unstable_by(&mut value_indices, len, |a, b| {
lexicographical_comparator.compare(*a, *b)
});
```
The result is finally built from the sorted prefix:
```rust
value_indices[..len]
.iter()
.map(|i| *i as u32)
.collect::<Vec<_>>()
```
As a result, bounded lexsorts still allocate and initialize indices for rows
outside the returned prefix.
## Current Behavior
For bounded lexsorts:
```text
all rows
-> allocate and initialize one index per row
-> sort only the requested prefix
-> return the requested prefix
```
This is correct, but it does unnecessary index allocation for small limits.
## Implications
Small-limit lexsorts spend time and memory creating indices that cannot
appear in the returned result unless they are selected into the requested
prefix.
This adds avoidable work for cases such as:
```text
lexsort ... limit 10
lexsort ... limit 100
```
where the requested output is much smaller than the input.
## Proposed Solution
For sufficiently small limits, maintain only the current top-k row indices
while scanning the input, instead of first materializing every row index.
The existing partial-sort path should remain in use for larger limits, where
maintaining a bounded heap is not beneficial.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]