e-dard opened a new pull request #722:
URL: https://github.com/apache/arrow-datafusion/pull/722


   **NOTE**: this requires Arrow 5.0 containing the following PR: 
https://github.com/apache/arrow-rs/pull/542
   
   # Which issue does this PR close?
   This will effectively close out 
https://github.com/apache/arrow-datafusion/issues/655
   
    # Rationale for this change
   
   When merging input record batches together into a single output stream the 
`SortPreservingMergeExec` operator currently builds a comparator for each 
column in the inputs every single time it compares one row to another. Due to 
the cost of building an Arrow `DynComparator` this ends up being prohibitively 
expensive.
   
   (I have more details in 
https://github.com/influxdata/influxdb_iox/issues/1983 but you can get the gist 
of the problem by looking at this profile:
   
   ![Screenshot 2021-07-12 at 10 50 
22](https://user-images.githubusercontent.com/501993/125497581-3d08f363-203e-4699-b75a-ba0f1faf60e4.png)
   
   The same comparator should be usable for comparing two input record batches 
until you have completely merged their rows. Therefore the rationale for the 
change in this PR is ensure that that's what happens.
   
   # What changes are included in this PR?
   
   The state associated with the process or merging two input record batches is 
managed by a `SortKeyCursor`. It tracks the current row, the columns, the 
backing record batch and so on. This PR adds a collection of Arrow 
`DynComparator`s that can be used (re-used) every time that a `SortKeyCursor` 
needs to compare one of its rows to that of another `SortKeyCursor`.
   
   In order to know whether a cursor was being compared to a cursor it had seen 
before or not I needed to be able to uniquely identify an input record batch. I 
did this by incrementing an index each time a new record batch was fed into the 
operator.
   
   When row comparison is happening the comparator collection is consulted and 
the `DynComparator` used if it exists. Otherwise one is created and stored.
   
   ### Performance
   
   This PR significantly improves the performance of the 
`SortPreservingMergeExec` operator because it amortises the cost of creating a 
comparator over all row comparisons in the record batch, rather than it being a 
fixed cost for every single row comparison. Here are some existing benchmark 
results:
   
   ```
   $  critcmp master pr
   group                               master                          pr
   -----                               ------                          --
   interleave_batches                  1.83   623.8±12.41µs            1.00    
341.2±6.98µs        
   merge_batches_no_overlap_large      1.56    400.6±4.94µs            1.00    
256.3±6.57µs        
   merge_batches_no_overlap_small      1.63   425.1±24.88µs            1.00    
261.1±7.46µs        
   merge_batches_small_into_large      1.18    228.0±3.95µs            1.00    
193.6±2.86µs        
   merge_batches_some_overlap_large    1.68   505.4±10.27µs            1.00    
301.3±6.63µs        
   merge_batches_some_overlap_small    1.64    515.7±5.21µs            1.00   
314.6±12.66µs        
   ```
   
   The results sugges that the changes in this PR improve performance bu upto 
`1.8x`. However, since the performance delta is tied to the size of the input, 
the performance boost could be significantly larger for larger inputs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to