jorgecarleitao opened a new pull request #8853:
URL: https://github.com/apache/arrow/pull/8853
This PR:
* extends the types that `concat` support for all types that
`MutableArrayData` supports (i.e. it now supports nested Lists, all primitives,
boolean, string and large string, etc.)
* makes `concat` 6x faster for primitive types and 2x faster for string
types (and likely also for the other types)
* changes `concat`'s signature to `&[&Array]` instead of `&[Arc<Array>]`, to
avoid an `Arc::clone`.
Since `XBuilder::append_data` was specifically built for this kernel but is
not used, and `MutableArrayData` offers a more generic API for it, this PR
removes that code.
The overall principle for this removal is that `Builder` is the API to build
an arrow array from elements or slices of rust native types, while the
`MutableArrayData` (for a lack of a better name) is suited to build an arrow
array from an existing set of arrow arrays. In the case of `concat`, this
corresponds to mem-copies of the individual arrays (taking into account nulls
and all that stuff) in sequence.
Based on this principle, `Builder` does not need to know how to build an
array from existing arrays (the `append_data`).
I would like to migrate all the tests for the `XBuilder::append_data` to the
`MutableArrayData`, to not lose them, but for that #8850 #8852 #8851 and #8849
and #8848 needs to land first (thus being a draft).
Benchmarks:
| benchmark | variation (%) |
|-------------- | -------------- |
| concat str 1024 | -45.3 |
| concat str nulls 1024 | -61.1 |
| concat i32 1024 | -83.5 |
| concat i32 nulls 1024 | -86.1 |
```
git checkout 66468daf0b3ac3ef08b7c99c690e7b845f23ad2b
cargo bench --bench concatenate_kernel
git checkout concat
cargo bench --bench concatenate_kernel
```
```
Previous HEAD position was 66468daf0 Added concatenate bench
Switched to branch 'concat'
Compiling arrow v3.0.0-SNAPSHOT
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
Finished bench [optimized] target(s) in 58.72s
Running
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/concatenate_kernel-94b8f5621cd4f767
Gnuplot not found, using plotters backend
concat i32 1024 time: [4.2852 us 4.2912 us 4.2973 us]
change: [-83.690% -83.469% -83.188%] (p = 0.00 <
0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low severe
4 (4.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
concat i32 nulls 1024 time: [4.8617 us 4.8820 us 4.9080 us]
change: [-86.335% -86.101% -85.813%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
4 (4.00%) high mild
4 (4.00%) high severe
concat str 1024 time: [19.472 us 19.527 us 19.593 us]
change: [-46.212% -45.314% -44.341%] (p = 0.00 <
0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
concat str nulls 1024 time: [39.447 us 39.525 us 39.613 us]
change: [-61.858% -61.091% -60.311%] (p = 0.00 <
0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
3 (3.00%) low mild
5 (5.00%) high mild
5 (5.00%) high severe
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]