jorgecarleitao opened a new pull request #8630:
URL: https://github.com/apache/arrow/pull/8630
## Motivation
The current code-base to `filter` an array is typed, which means that we
can't easily generalize it for arbitrary depths.
That code is similar to the code in `take`: both these operations perform a
very similar operation: from an array, produce another array taking elements
from an existing array.
## This PR
This is a proposal to introduce a new `struct` (`MutableArrayData` for the
lack of a better name) that is useful to efficiently construct an array taken
from another array. The main use-case of this `struct` is to be used in unary
operations that are type-independent, such as `filter`, `take` and `sort`.
With this `struct`, this PR:
* Adds `MutableArrayData` to efficiently create `ArrayData` as partial
`memcopies` from another `ArrayData`
* Adds support to filtering of `List`s of arbitrary depth and types
* Speeds-up most filtering operations on the expensive cases (high
selectivity => high memcopies), some by a factor of 3x, and speeds-down
performance by a factor of 2x for the cheap cases (low selectivity => low
memcopies) or without filter preparation (previously `FilterContext`)
* improved the benchmark to use random numbers to avoid spurious
correlations (e.g. null and filter mask was 100% correlated in one of the
cases). Also added benchmark for strings.
* (only +170 LOC, even after all the new tests and benchmarks)
The gist of this `struct` can be seen in the example reproduced below:
```rust
use std::sync::Arc;
use arrow::{array::{Int32Array, Array, MutableArrayData}};
let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data();
// Create a new `MutableArrayData` from an array and with a capacity.
let mut mutable = MutableArrayData::new(&array, 4);
mutable.extend(1, 3); // extend from the slice [1..3], [2,3]
mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3]
// `.freeze()` to convert `MutableArrayData` into a `ArrayData`.
let new_array = Int32Array::from(Arc::new(mutable.freeze()));
assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array);
```
## Design
There are 3 core ideas on this design:
1. it operates at the `ArrayData`, i.e. it is (rust-)type-independent and
thus can be used on a variety of use-cases.
2. it assumes that it is always advantageous to memcopy a slice of size 2*N
than 2 slices of size N.
3. it assumes that it is advantageous to pre-allocate memory (to the extent
possible)
With this in mind, the API has 3 methods: `new`, `extend` and `freeze`. The
code in `filter` was reducing memcopies to some extent, but this PR further
expands on this idea by copying the largest possible chunk, instead of up to 64
slots.
## Benchmarks
<details>
<summary>Benchmarks</summary>
```bash
set -e
git checkout 02f8de0513f663364b4bded6612a25ee30ab5286
cargo bench --bench filter_kernels
git checkout mutable_filter
cargo bench --bench filter_kernels
```
```bash
Previous HEAD position was 02f8de051 Improved filter bench.
Switched to branch 'mutable_filter'
Your branch is up to date with 'origin/mutable_filter'.
Compiling arrow v3.0.0-SNAPSHOT
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
Finished bench [optimized] target(s) in 49.79s
Running
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-9f6c18b388f93061
Gnuplot not found, using plotters backend
filter u8 low selectivity
time: [493.39 us 494.31 us 495.64 us]
change: [+34.336% +35.051% +35.925%] (p = 0.00 <
0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high severe
filter u8 high selectivity
time: [20.906 us 20.961 us 21.028 us]
change: [+18.384% +19.687% +21.168%] (p = 0.00 <
0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe
filter u8 very low selectivity
time: [16.364 us 16.403 us 16.444 us]
change: [+105.77% +107.52% +110.04%] (p = 0.00 <
0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
filter context u8 low selectivity
time: [237.92 us 238.17 us 238.46 us]
change: [-33.849% -33.551% -33.168%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe
filter context u8 high selectivity
time: [3.8479 us 3.8552 us 3.8639 us]
change: [-67.849% -67.572% -67.262%] (p = 0.00 <
0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe
Benchmarking filter context u8 very low selectivity: Collecting 100 samples
in estimated 5.0053 s (4.6M iteration
filter
context u8 very low selectivity
time: [1.0968 us 1.1008 us 1.1051 us]
change: [-56.118% -55.701% -55.194%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
Benchmarking filter context u8 w NULLs low selectivity: Collecting 100
samples in estimated 5.7185 s (10k iterati
filter context u8 w NULLs low selectivity
time: [557.84 us 558.80 us 559.97 us]
change: [+3.2555% +3.9061% +4.4997%] (p = 0.00 <
0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
Benchmarking filter context u8 w NULLs high selectivity: Collecting 100
samples in estimated 5.9583 s (20k iterat
filter context u8 w NULLs high selectivity
time: [294.03 us 295.12 us 296.23 us]
change: [-20.326% -19.909% -19.464%] (p = 0.00 <
0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking filter context u8 w NULLs very low selectivity: Collecting 100
samples in estimated 5.0056 s (3.4M i
filter
context u8 w NULLs very low selectivity
time: [1.4384 us 1.4422 us 1.4462 us]
change: [-56.333% -56.014% -55.694%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
filter context f32 low selectivity
time: [570.34 us 571.26 us 572.37 us]
change: [+2.0407% +3.0080% +4.1985%] (p = 0.00 <
0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) high mild
9 (9.00%) high severe
filter context f32 high selectivity
time: [303.00 us 303.76 us 304.57 us]
change: [-20.733% -20.142% -19.444%] (p = 0.00 <
0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
Benchmarking filter context f32 very low selectivity: Collecting 100 samples
in estimated 5.0063 s (4.0M iteratio
filter
context f32 very low selectivity
time: [1.2649 us 1.2751 us 1.2887 us]
change: [-64.759% -64.241% -63.628%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe
filter context string low selectivity
time: [774.88 us 775.66 us 776.49 us]
change: [-31.240% -30.766% -30.270%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
Benchmarking filter context string high selectivity: Collecting 100 samples
in estimated 9.9016 s (10k iterations
filter
context string high selectivity
time: [952.06 us 953.63 us 955.55 us]
change: [-50.955% -50.495% -49.945%] (p = 0.00 <
0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) high mild
5 (5.00%) high severe
Benchmarking filter context string very low selectivity: Collecting 100
samples in estimated 5.0151 s (677k itera
filter context string very low selectivity
time: [7.2890 us 7.3009 us 7.3150 us]
change: [+75.618% +77.102% +78.817%] (p = 0.00 <
0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
```
</details>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]