jorgecarleitao opened a new pull request #8670:
URL: https://github.com/apache/arrow/pull/8670


   TL;DR:
   
   * Add support to filter `StructArray`
   * fixes 2 bugs in `take` of `StructArray` and `ListArray`
   * speedup `take` by 1.2-1.9
   
   ## Motivation
   
   Same motivation as #8630 plus solving the following issues:
   
   * ARROW-10591
   * ARROW-10592
   * ARROW-10593
   * ARROW-10594
   
   They are all inter-connected, which made it difficult to solve one without 
solving the other (as the tests either pass or don't).
   
   ## This PR
   
   This PR extends `MutableDataArray` to support extending the 
`MutableDataArray` with nulls, thereby allowing to correctly implement `take` 
and `filter` for `StructArray`. Specifically, currently,
   
   * `take` for structs with nulls is incorrect (ARROW-10593)
   * `filter` does not support `StructArray` (ARROW-10591)
   
   This PR fixes both issues.
   
   This PR also converts the implementation of `take` to use 
`MutableDataArray`, thereby reducing the chance of bugs and maintenance burden, 
as well as performance. While doing that, I found 1 bug and 1 improvement, that 
this PR addresses:
   
   * currently, `take` has a bug on which nulls are not correctly taken into 
account (ARROW-10593) 
   * currently `take` takes all values from a struct array, even when we are 
just taking a null  (ARROW-10594)
   
   This PR fixes both issues.
   
   Finally, this improves performance of `take` by 1.2-1.9x (which also impacts 
the performance of the `sort`).
   
   <details>
    <summary>Benchmarks</summary>
   
   ```bash
   set -e
   git checkout 010d260173a9e110901ca67372a4ac379a615a13
   cargo bench --bench filter_kernels
   git checkout mutable_filter
   cargo bench --bench filter_kernels
   ```
   
   ```bash
   Previous HEAD position was e54bd6272 Migrated filter.
   Switched to branch 'clean_take'
   Your branch is up to date with 'origin/clean_take'.
      Compiling arrow v3.0.0-SNAPSHOT 
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
       Finished bench [optimized] target(s) in 45.00s
        Running 
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/take_kernels-cd6b83d872e2c3bf
   Gnuplot not found, using plotters backend
   take i32 512            time:   [6.1191 us 6.1260 us 6.1333 us]              
            
                           change: [-20.128% -19.695% -19.203%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   
   take i32 1024           time:   [10.691 us 10.713 us 10.737 us]              
             
                           change: [-23.008% -22.493% -21.981%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     5 (5.00%) high mild
     5 (5.00%) high severe
   
   take bool 512           time:   [5.1620 us 5.1880 us 5.2168 us]              
             
                           change: [-38.963% -38.374% -37.725%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   take bool 1024          time:   [8.5777 us 8.5896 us 8.6024 us]              
              
                           change: [-45.607% -45.364% -45.092%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     3 (3.00%) high mild
     5 (5.00%) high severe
   
   take str 512            time:   [13.796 us 13.811 us 13.826 us]              
            
                           change: [-24.722% -24.254% -23.764%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) high mild
     4 (4.00%) high severe
   
   take str 1024           time:   [25.149 us 25.167 us 25.186 us]              
             
                           change: [-26.483% -26.051% -25.609%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   ```
   </details>
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to