neilconway opened a new pull request, #21083:
URL: https://github.com/apache/datafusion/pull/21083
## Which issue does this PR close?
- Closes #21005.
- Closes #21041.
## Rationale for this change
The previous `array_sort` implementation called the Arrow sort kernel for
every row, and then used `concat` to produce the final results. This was quite
inefficient. Instead, we employee three different techniques depending on the
input:
(1) For arrays of primitives types without null elements, we copy all values
into a single `Vec`, sort each row's slice of the `Vec` in-place, and then wrap
the `Vec` in a `GenericListArray`.
(2) For arrays of primitives types with null elements, we use a similar
approach but we need to incur some more bookkeeping to place null elements in
the right place and construct the null buffer.
(3) For arrays of non-primitive types, we use `RowConverter` to convert the
entire input into the row format in one call, sort row indices by comparing the
encoded row values, and then use a single `take()` to construct the result of
the sort.
Benchmarks (8192 rows, vs main):
int32/5 elements: 886 µs → 57 µs (-94%)
int32/20 elements: 1.64 ms → 846 µs (-48%)
int32/100 elements: 4.03 ms → 3.22 ms (-20%)
int32_null_elements/5: 1.17 ms → 168 µs (-86%)
int32_null_elements/1000: 47.2 ms → 44.1 ms (-7%)
string/5 elements: 2.12 ms → 727 µs (-66%)
string/1000 elements: 405 ms → 293 ms (-28%)
## What changes are included in this PR?
* New `array_sort` benchmark
* Extended unit test coverage
* Improve docs
* Implement optimizations as described above
## Are these changes tested?
No.
## Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]