andygrove opened a new pull request #9043: URL: https://github.com/apache/arrow/pull/9043
This PR introduces a new `CoalesceBatchesExec` physical operator which combines small input batches and produces larger output batches. The physical optimizer inserts this operator around filters because highly selective filters can produce lots of small batches and this causes poor performance in some cases (especially joins) because we lose some of the benefits of vectorization if we have batches with single rows for example. For TPC-H q12 at SF=100 and 8 partitions, this provides the following speedups: | Batch Size | Master | This PR | | --- | --- | --- | | 16384 | 183.1 s | 41.1 s | | 32768 | 59.4 s | 27.0 s | | 65536 | 27.5 s | 19.3 s | | 131072 | 18.4 s | 15.7 s | Note that the new `CoalesceBatchesExec` uses `MutableArrayData` which still suffers from some kind of exponential slowdown as the number of batches increases, so we should be able to optimize this further, but at least we're using `MutableArrayData` to combine smaller numbers of batches now. Even if we fix the slowdown in `MutableArrayData`, we would still want `CoalesceBatchesExec` to help avoid tiny batches. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
