neilconway opened a new pull request, #21083:
URL: https://github.com/apache/datafusion/pull/21083

   ## Which issue does this PR close?
   
   - Closes #21005.
   - Closes #21041.
   
   ## Rationale for this change
   
   The previous `array_sort` implementation called the Arrow sort kernel for 
every row, and then used `concat` to produce the final results. This was quite 
inefficient. Instead, we employee three different techniques depending on the 
input:
   
   (1) For arrays of primitives types without null elements, we copy all values 
into a single `Vec`, sort each row's slice of the `Vec` in-place, and then wrap 
the `Vec` in a `GenericListArray`.
   
   (2) For arrays of primitives types with null elements, we use a similar 
approach but we need to incur some more bookkeeping to place null elements in 
the right place and construct the null buffer.
   
   (3) For arrays of non-primitive types, we use `RowConverter` to convert the 
entire input into the row format in one call, sort row indices by comparing the 
encoded row values, and then use a single `take()` to construct the result of 
the sort.
   
   Benchmarks (8192 rows, vs main):
   
       int32/5 elements:          886 µs →  57 µs  (-94%)
       int32/20 elements:        1.64 ms → 846 µs  (-48%)
       int32/100 elements:       4.03 ms → 3.22 ms (-20%)
       int32_null_elements/5:    1.17 ms → 168 µs  (-86%)
       int32_null_elements/1000: 47.2 ms → 44.1 ms  (-7%)
       string/5 elements:        2.12 ms → 727 µs  (-66%)
       string/1000 elements:      405 ms → 293 ms  (-28%)
   
   ## What changes are included in this PR?
   
   * New `array_sort` benchmark
   * Extended unit test coverage
   * Improve docs
   * Implement optimizations as described above
   
   ## Are these changes tested?
   
   No.
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to