e-dard opened a new pull request #691:
URL: https://github.com/apache/arrow-datafusion/pull/691


   This PR is part of work I am doing in support of #655. I expect to land some 
further improvements to this operator, so I don't know if you want to close out 
#655 with this PR and I can make a new ticket each time, or just keep #655 open.
   
   ## Rationale
   
   I work on a project (InfluxDB IOx) that exectues an in-memory compaction of 
collections of record batches using DataFusion. It is very common for input 
record batches to overlap with each other with respect to some sort key and we 
are seeing the `SortPreservingMergeExec` operator eating up a lot of CPU time.
   
   One of the places where it's clear that we can improve performance is when 
materialising output record batches. The current logic of the operator uses the 
`MutableArray` API to copy over rows from operator inputs to the merged output. 
The 
[extend()](https://github.com/apache/arrow-rs/blob/master/arrow/src/array/transform/mod.rs#L600-L609)
 call is being called for every single input row, however the API supports 
batches copies wherein you can specify a start and end row index. 
   
   This PR:
   
    -  c352fa3 Adds benchmarks to cover the execution of the 
`SortPreservingMergeExec` under some scenarios;
    -  d7f76b3 Implements logic to coalesce calls the `MutableArray.extend` 
over all contiuguous rows from the same input.
    - 145a5b8 increases test coverage of coalescing logic.
   
   ## Benchmarks
   
   I have included benchmarks that evaluate performance on two input record 
batches in a number of scenarios:
   
    - `interleave`: the record batches both contain the same data, therefore 
one row at a time from each input is pushed to the output. This scenarios is 
the one where this PR will have the **least** improvement because there isn't 
much to coalesce.
    - `merge_batches_some_overlap`: the record batches contain similar data, 
however the rows need to be merged in a way where for `n` output rows in a 
batch approximately `1/3` rows first come from input A all at once, then `1/3` 
of the rows are effectively the same in both inputs resulting in many calls to 
`extend` then a `1/3` of the rows comes from input B.
    - `merge_batches_no_overlap`: in these cases both record batches contain 
data for the same sort keys but neither of the rows overlap for each unique 
sort key. That is, for a given unique sort key all rows from input A need to be 
sent to the output, and then all rows from input B need to be sent to the 
output.
   - `merge_batches_small_into_large`: in this case the two inputs share 
similar data but one input is significantly smaller than the other. This is a 
particularly realistic scenario for our use-case where we are compacting 
smaller record batches into larger ones.
   
   Here are the results using `critcmp`
   
   ```
   group                               master                                 pr
   -----                               ------                                 --
   interleave_batches                  1.04   637.5±51.84µs        ? ?/sec    
1.00   615.5±12.13µs
   merge_batches_no_overlap_large      1.12    454.9±2.90µs        ? ?/sec    
1.00   404.9±10.94µs
   merge_batches_no_overlap_small      1.14    485.1±6.67µs        ? ?/sec    
1.00    425.7±9.33µs 
   merge_batches_small_into_large      1.14    263.0±8.85µs        ? ?/sec    
1.00    229.7±5.23µs  
   merge_batches_some_overlap_large    1.05    532.5±8.33µs        ? ?/sec    
1.00   508.3±14.24µs
   merge_batches_some_overlap_small    1.06   546.9±12.82µs        ? ?/sec    
1.00   516.9±13.20µs
   ```
   
   The benchmarks show that in the case where this PR is likely to make the 
**least** difference, the PR's `SortPreservingMergeExec` is now about `~3%` 
faster. 
   
   In the case where the PR is likely to make the most difference (when for 
each set of rows that need to be merged in order, you can just call `extend` in 
input A and then on input B) the PR's `SortPreservingMergeExec` is now about 
`12%` faster.
   
   **Note**: `critcmp` reports how much slower things are on `master` hence the 
slightly different numbers in the benchmark results.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to