Dandandan opened a new pull request #9834:
URL: https://github.com/apache/arrow/pull/9834


   I did some comparisons with different batch sized with TCP-H on SF=1 in 
memory / 16 partitions. We chose a higher batch_size earlier as DF had some 
problems with smaller batch sizes (in hash join, but also because it missed the 
CoalesceBatches node).
   Smaller batch sizes have some performance benefit: more (possible) 
parallelism for smaller tables, batches are available sooner (in 
`CoalesceBatches`), etc. Also, memory usage can be reduced.
   
   But now it seems a value around 8000 is the sweet spot, only query 1 is 
faster with a slightly smaller batch size.
   
   
![image](https://user-images.githubusercontent.com/163737/112861775-669d5780-90b5-11eb-82e8-26c846d87075.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to