jorgecarleitao opened a new pull request #8630: URL: https://github.com/apache/arrow/pull/8630
## Motivation The current code-base to `filter` an array is typed, which means that we can't easily generalize it for arbitrary depths. That code is similar to the code in `take`: both these operations perform a very similar operation: from an array, produce another array taking elements from an existing array. ## This PR This is a proposal to introduce a new `struct` (`MutableArrayData` for the lack of a better name) that is useful to efficiently construct an array taken from another array. The main use-case of this `struct` is to be used in unary operations that are type-independent, such as `filter`, `take` and `sort`. With this `struct`, this PR: * Adds `MutableArrayData` to efficiently create `ArrayData` as partial `memcopies` from another `ArrayData` * Adds support to filtering of `List`s of arbitrary depth and types * Speeds-up most filtering operations on the expensive cases (high selectivity => high memcopies), some by a factor of 3x, and speeds-down performance by a factor of 2x for the cheap cases (low selectivity => low memcopies) or without filter preparation (previously `FilterContext`) * improved the benchmark to use random numbers to avoid spurious correlations (e.g. null and filter mask was 100% correlated in one of the cases). Also added benchmark for strings. * (only +170 LOC, even after all the new tests and benchmarks) The gist of this `struct` can be seen in the example reproduced below: ```rust use std::sync::Arc; use arrow::{array::{Int32Array, Array, MutableArrayData}}; let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data(); // Create a new `MutableArrayData` from an array and with a capacity. let mut mutable = MutableArrayData::new(&array, 4); mutable.extend(1, 3); // extend from the slice [1..3], [2,3] mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3] // `.freeze()` to convert `MutableArrayData` into a `ArrayData`. let new_array = Int32Array::from(Arc::new(mutable.freeze())); assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array); ``` ## Design There are 3 core ideas on this design: 1. it operates at the `ArrayData`, i.e. it is (rust-)type-independent and thus can be used on a variety of use-cases. 2. it assumes that it is always advantageous to memcopy a slice of size 2*N than 2 slices of size N. 3. it assumes that it is advantageous to pre-allocate memory (to the extent possible) With this in mind, the API has 3 methods: `new`, `extend` and `freeze`. The code in `filter` was reducing memcopies to some extent, but this PR further expands on this idea by copying the largest possible chunk, instead of up to 64 slots. ## Benchmarks <details> <summary>Benchmarks</summary> ```bash set -e git checkout 02f8de0513f663364b4bded6612a25ee30ab5286 cargo bench --bench filter_kernels git checkout mutable_filter cargo bench --bench filter_kernels ``` ```bash Previous HEAD position was 02f8de051 Improved filter bench. Switched to branch 'mutable_filter' Your branch is up to date with 'origin/mutable_filter'. Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 49.79s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-9f6c18b388f93061 Gnuplot not found, using plotters backend filter u8 low selectivity time: [493.39 us 494.31 us 495.64 us] change: [+34.336% +35.051% +35.925%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 6 (6.00%) high severe filter u8 high selectivity time: [20.906 us 20.961 us 21.028 us] change: [+18.384% +19.687% +21.168%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 6 (6.00%) high mild 7 (7.00%) high severe filter u8 very low selectivity time: [16.364 us 16.403 us 16.444 us] change: [+105.77% +107.52% +110.04%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 4 (4.00%) high mild 5 (5.00%) high severe filter context u8 low selectivity time: [237.92 us 238.17 us 238.46 us] change: [-33.849% -33.551% -33.168%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe filter context u8 high selectivity time: [3.8479 us 3.8552 us 3.8639 us] change: [-67.849% -67.572% -67.262%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe Benchmarking filter context u8 very low selectivity: Collecting 100 samples in estimated 5.0053 s (4.6M iteration filter context u8 very low selectivity time: [1.0968 us 1.1008 us 1.1051 us] change: [-56.118% -55.701% -55.194%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe Benchmarking filter context u8 w NULLs low selectivity: Collecting 100 samples in estimated 5.7185 s (10k iterati filter context u8 w NULLs low selectivity time: [557.84 us 558.80 us 559.97 us] change: [+3.2555% +3.9061% +4.4997%] (p = 0.00 < 0.05) Performance has regressed. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe Benchmarking filter context u8 w NULLs high selectivity: Collecting 100 samples in estimated 5.9583 s (20k iterat filter context u8 w NULLs high selectivity time: [294.03 us 295.12 us 296.23 us] change: [-20.326% -19.909% -19.464%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild Benchmarking filter context u8 w NULLs very low selectivity: Collecting 100 samples in estimated 5.0056 s (3.4M i filter context u8 w NULLs very low selectivity time: [1.4384 us 1.4422 us 1.4462 us] change: [-56.333% -56.014% -55.694%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe filter context f32 low selectivity time: [570.34 us 571.26 us 572.37 us] change: [+2.0407% +3.0080% +4.1985%] (p = 0.00 < 0.05) Performance has regressed. Found 12 outliers among 100 measurements (12.00%) 3 (3.00%) high mild 9 (9.00%) high severe filter context f32 high selectivity time: [303.00 us 303.76 us 304.57 us] change: [-20.733% -20.142% -19.444%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe Benchmarking filter context f32 very low selectivity: Collecting 100 samples in estimated 5.0063 s (4.0M iteratio filter context f32 very low selectivity time: [1.2649 us 1.2751 us 1.2887 us] change: [-64.759% -64.241% -63.628%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe filter context string low selectivity time: [774.88 us 775.66 us 776.49 us] change: [-31.240% -30.766% -30.270%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe Benchmarking filter context string high selectivity: Collecting 100 samples in estimated 9.9016 s (10k iterations filter context string high selectivity time: [952.06 us 953.63 us 955.55 us] change: [-50.955% -50.495% -49.945%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe Benchmarking filter context string very low selectivity: Collecting 100 samples in estimated 5.0151 s (677k itera filter context string very low selectivity time: [7.2890 us 7.3009 us 7.3150 us] change: [+75.618% +77.102% +78.817%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe ``` </details> ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org