andygrove opened a new pull request #9043:
URL: https://github.com/apache/arrow/pull/9043


   This PR introduces a new `CoalesceBatchesExec` physical operator which 
combines small input batches and produces larger output batches. The physical 
optimizer inserts this operator around filters because highly selective filters 
can produce lots of small batches and this causes poor performance in some 
cases (especially joins) because we lose some of the benefits of vectorization 
if we have batches with single rows for example.
   
   For TPC-H q12 at SF=100 and 8 partitions, this provides the following 
speedups:
   
   | Batch Size | Master | This PR |
   | --- | --- | --- |
   | 16384 | 183.1 s | 41.1 s |
   | 32768 | 59.4 s | 27.0 s |
   | 65536 | 27.5 s | 19.3 s | 
   | 131072 | 18.4 s | 15.7 s | 
   
   Note that the new `CoalesceBatchesExec` uses `MutableArrayData` which still 
suffers from some kind of exponential slowdown as the number of batches 
increases, so we should be able to optimize this further, but at least we're 
using `MutableArrayData` to combine smaller numbers of batches now.
   
   Even if we fix the slowdown in `MutableArrayData`, we would still want 
`CoalesceBatchesExec` to help avoid tiny batches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to