jorgecarleitao opened a new pull request #8630:
URL: https://github.com/apache/arrow/pull/8630


   ## Motivation
   
   The current code-base to `filter` an array is typed, which means that we 
can't easily generalize it for arbitrary depths.
   
   That code is similar to the code in `take`: both these operations perform a 
very similar operation: from an array, produce another array taking elements 
from an existing array.
   
   ## This PR
   
   This is a proposal to introduce a new `struct` (`MutableArrayData` for the 
lack of a better name) that is useful to efficiently construct an array taken 
from another array. The main use-case of this `struct` is to be used in unary 
operations that are type-independent, such as `filter`, `take` and `sort`.
   
   With this `struct`, this PR:
   
   * Adds `MutableArrayData` to efficiently create `ArrayData` as partial 
`memcopies` from another `ArrayData`
   * Adds support to filtering of `List`s of arbitrary depth and types
   * Speeds-up most filtering operations on the expensive cases (high 
selectivity => high memcopies), some by a factor of 3x, and speeds-down 
performance by a factor of 2x for the cheap cases (low selectivity  => low 
memcopies) or without filter preparation (previously `FilterContext`)
   * improved the benchmark to use random numbers to avoid spurious 
correlations (e.g. null and filter mask was 100% correlated in one of the 
cases). Also added benchmark for strings.
   * (only +170 LOC, even after all the new tests and benchmarks)
   
   The gist of this `struct` can be seen in the example reproduced below:
   
   ```rust
   use std::sync::Arc;
   use arrow::{array::{Int32Array, Array, MutableArrayData}};
   
   let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data();
   
   // Create a new `MutableArrayData` from an array and with a capacity.
   let mut mutable = MutableArrayData::new(&array, 4);
   mutable.extend(1, 3); // extend from the slice [1..3], [2,3]
   mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3]
   
   // `.freeze()` to convert `MutableArrayData` into a `ArrayData`.
   let new_array = Int32Array::from(Arc::new(mutable.freeze()));
   
   assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array);
   ```
   
   ## Design
   
   There are 3 core ideas on this design:
   
   1. it operates at the `ArrayData`, i.e. it is (rust-)type-independent and 
thus can be used on a variety of use-cases.
   2. it assumes that it is always advantageous to memcopy a slice of size 2*N 
than 2 slices of size N.
   3. it assumes that it is advantageous to pre-allocate memory (to the extent 
possible)
   
   With this in mind, the API has 3 methods: `new`, `extend` and `freeze`. The 
code in `filter` was reducing memcopies to some extent, but this PR further 
expands on this idea by copying the largest possible chunk, instead of up to 64 
slots.
   
   ## Benchmarks
   
   <details>
    <summary>Benchmarks</summary>
   
   ```bash
   set -e
   git checkout 02f8de0513f663364b4bded6612a25ee30ab5286
   cargo bench --bench filter_kernels
   git checkout mutable_filter
   cargo bench --bench filter_kernels
   ```
   
   ```bash
   Previous HEAD position was 02f8de051 Improved filter bench.
   Switched to branch 'mutable_filter'
   Your branch is up to date with 'origin/mutable_filter'.
      Compiling arrow v3.0.0-SNAPSHOT 
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
       Finished bench [optimized] target(s) in 49.79s
        Running 
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-9f6c18b388f93061
   Gnuplot not found, using plotters backend
   filter u8 low selectivity                                                    
                        
                           time:   [493.39 us 494.31 us 495.64 us]
                           change: [+34.336% +35.051% +35.925%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 6 outliers among 100 measurements (6.00%)
     6 (6.00%) high severe
   
   filter u8 high selectivity                                                   
                          
                           time:   [20.906 us 20.961 us 21.028 us]
                           change: [+18.384% +19.687% +21.168%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 13 outliers among 100 measurements (13.00%)
     6 (6.00%) high mild
     7 (7.00%) high severe
   
   filter u8 very low selectivity                                               
                              
                           time:   [16.364 us 16.403 us 16.444 us]
                           change: [+105.77% +107.52% +110.04%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 9 outliers among 100 measurements (9.00%)
     4 (4.00%) high mild
     5 (5.00%) high severe
   
   filter context u8 low selectivity                                            
                                
                           time:   [237.92 us 238.17 us 238.46 us]
                           change: [-33.849% -33.551% -33.168%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) high mild
     4 (4.00%) high severe
   
   filter context u8 high selectivity                                           
                                  
                           time:   [3.8479 us 3.8552 us 3.8639 us]
                           change: [-67.849% -67.572% -67.262%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     2 (2.00%) high mild
     5 (5.00%) high severe
   
   Benchmarking filter context u8 very low selectivity: Collecting 100 samples 
in estimated 5.0053 s (4.6M iteration                                           
                                                                      filter 
context u8 very low selectivity                        
                           time:   [1.0968 us 1.1008 us 1.1051 us]
                           change: [-56.118% -55.701% -55.194%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     2 (2.00%) high mild
     3 (3.00%) high severe
   
   Benchmarking filter context u8 w NULLs low selectivity: Collecting 100 
samples in estimated 5.7185 s (10k iterati                                      
                                                                           
filter context u8 w NULLs low selectivity                        
                           time:   [557.84 us 558.80 us 559.97 us]
                           change: [+3.2555% +3.9061% +4.4997%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 12 outliers among 100 measurements (12.00%)
     4 (4.00%) high mild
     8 (8.00%) high severe
   
   Benchmarking filter context u8 w NULLs high selectivity: Collecting 100 
samples in estimated 5.9583 s (20k iterat                                       
                                                                          
filter context u8 w NULLs high selectivity                        
                           time:   [294.03 us 295.12 us 296.23 us]
                           change: [-20.326% -19.909% -19.464%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   Benchmarking filter context u8 w NULLs very low selectivity: Collecting 100 
samples in estimated 5.0056 s (3.4M i                                           
                                                                      filter 
context u8 w NULLs very low selectivity                        
                           time:   [1.4384 us 1.4422 us 1.4462 us]
                           change: [-56.333% -56.014% -55.694%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     2 (2.00%) high mild
     3 (3.00%) high severe
   
   filter context f32 low selectivity                                           
                                 
                           time:   [570.34 us 571.26 us 572.37 us]
                           change: [+2.0407% +3.0080% +4.1985%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 12 outliers among 100 measurements (12.00%)
     3 (3.00%) high mild
     9 (9.00%) high severe
   
   filter context f32 high selectivity                                          
                                  
                           time:   [303.00 us 303.76 us 304.57 us]
                           change: [-20.733% -20.142% -19.444%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high severe
   
   Benchmarking filter context f32 very low selectivity: Collecting 100 samples 
in estimated 5.0063 s (4.0M iteratio                                            
                                                                     filter 
context f32 very low selectivity                        
                           time:   [1.2649 us 1.2751 us 1.2887 us]
                           change: [-64.759% -64.241% -63.628%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     3 (3.00%) high mild
     7 (7.00%) high severe
   
   filter context string low selectivity                                        
                                    
                           time:   [774.88 us 775.66 us 776.49 us]
                           change: [-31.240% -30.766% -30.270%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   
   Benchmarking filter context string high selectivity: Collecting 100 samples 
in estimated 9.9016 s (10k iterations                                           
                                                                      filter 
context string high selectivity                        
                           time:   [952.06 us 953.63 us 955.55 us]
                           change: [-50.955% -50.495% -49.945%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) high mild
     5 (5.00%) high severe
   
   Benchmarking filter context string very low selectivity: Collecting 100 
samples in estimated 5.0151 s (677k itera                                       
                                                                          
filter context string very low selectivity                        
                           time:   [7.2890 us 7.3009 us 7.3150 us]
                           change: [+75.618% +77.102% +78.817%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   ```
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to