jorgecarleitao opened a new pull request #8670:
URL: https://github.com/apache/arrow/pull/8670
TL;DR:
* Add support to filter `StructArray`
* fixes 2 bugs in `take` of `StructArray` and `ListArray`
* speedup `take` by 1.2-1.9
## Motivation
Same motivation as #8630 plus solving the following issues:
* ARROW-10591
* ARROW-10592
* ARROW-10593
* ARROW-10594
They are all inter-connected, which made it difficult to solve one without
solving the other (as the tests either pass or don't).
## This PR
This PR extends `MutableDataArray` to support extending the
`MutableDataArray` with nulls, thereby allowing to correctly implement `take`
and `filter` for `StructArray`. Specifically, currently,
* `take` for structs with nulls is incorrect (ARROW-10593)
* `filter` does not support `StructArray` (ARROW-10591)
This PR fixes both issues.
This PR also converts the implementation of `take` to use
`MutableDataArray`, thereby reducing the chance of bugs and maintenance burden,
as well as performance. While doing that, I found 1 bug and 1 improvement, that
this PR addresses:
* currently, `take` has a bug on which nulls are not correctly taken into
account (ARROW-10593)
* currently `take` takes all values from a struct array, even when we are
just taking a null (ARROW-10594)
This PR fixes both issues.
Finally, this improves performance of `take` by 1.2-1.9x (which also impacts
the performance of the `sort`).
<details>
<summary>Benchmarks</summary>
```bash
set -e
git checkout 010d260173a9e110901ca67372a4ac379a615a13
cargo bench --bench filter_kernels
git checkout mutable_filter
cargo bench --bench filter_kernels
```
```bash
Previous HEAD position was e54bd6272 Migrated filter.
Switched to branch 'clean_take'
Your branch is up to date with 'origin/clean_take'.
Compiling arrow v3.0.0-SNAPSHOT
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
Finished bench [optimized] target(s) in 45.00s
Running
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/take_kernels-cd6b83d872e2c3bf
Gnuplot not found, using plotters backend
take i32 512 time: [6.1191 us 6.1260 us 6.1333 us]
change: [-20.128% -19.695% -19.203%] (p = 0.00 <
0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
take i32 1024 time: [10.691 us 10.713 us 10.737 us]
change: [-23.008% -22.493% -21.981%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
take bool 512 time: [5.1620 us 5.1880 us 5.2168 us]
change: [-38.963% -38.374% -37.725%] (p = 0.00 <
0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
take bool 1024 time: [8.5777 us 8.5896 us 8.6024 us]
change: [-45.607% -45.364% -45.092%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
take str 512 time: [13.796 us 13.811 us 13.826 us]
change: [-24.722% -24.254% -23.764%] (p = 0.00 <
0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
take str 1024 time: [25.149 us 25.167 us 25.186 us]
change: [-26.483% -26.051% -25.609%] (p = 0.00 <
0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
```
</details>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]