e-dard opened a new pull request #722: URL: https://github.com/apache/arrow-datafusion/pull/722
**NOTE**: this requires Arrow 5.0 containing the following PR: https://github.com/apache/arrow-rs/pull/542 # Which issue does this PR close? This will effectively close out https://github.com/apache/arrow-datafusion/issues/655 # Rationale for this change When merging input record batches together into a single output stream the `SortPreservingMergeExec` operator currently builds a comparator for each column in the inputs every single time it compares one row to another. Due to the cost of building an Arrow `DynComparator` this ends up being prohibitively expensive. (I have more details in https://github.com/influxdata/influxdb_iox/issues/1983 but you can get the gist of the problem by looking at this profile:  The same comparator should be usable for comparing two input record batches until you have completely merged their rows. Therefore the rationale for the change in this PR is ensure that that's what happens. # What changes are included in this PR? The state associated with the process or merging two input record batches is managed by a `SortKeyCursor`. It tracks the current row, the columns, the backing record batch and so on. This PR adds a collection of Arrow `DynComparator`s that can be used (re-used) every time that a `SortKeyCursor` needs to compare one of its rows to that of another `SortKeyCursor`. In order to know whether a cursor was being compared to a cursor it had seen before or not I needed to be able to uniquely identify an input record batch. I did this by incrementing an index each time a new record batch was fed into the operator. When row comparison is happening the comparator collection is consulted and the `DynComparator` used if it exists. Otherwise one is created and stored. ### Performance This PR significantly improves the performance of the `SortPreservingMergeExec` operator because it amortises the cost of creating a comparator over all row comparisons in the record batch, rather than it being a fixed cost for every single row comparison. Here are some existing benchmark results: ``` $ critcmp master pr group master pr ----- ------ -- interleave_batches 1.83 623.8±12.41µs 1.00 341.2±6.98µs merge_batches_no_overlap_large 1.56 400.6±4.94µs 1.00 256.3±6.57µs merge_batches_no_overlap_small 1.63 425.1±24.88µs 1.00 261.1±7.46µs merge_batches_small_into_large 1.18 228.0±3.95µs 1.00 193.6±2.86µs merge_batches_some_overlap_large 1.68 505.4±10.27µs 1.00 301.3±6.63µs merge_batches_some_overlap_small 1.64 515.7±5.21µs 1.00 314.6±12.66µs ``` The results sugges that the changes in this PR improve performance bu upto `1.8x`. However, since the performance delta is tied to the size of the input, the performance boost could be significantly larger for larger inputs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
