[GitHub] [arrow] andygrove opened a new pull request #9036: ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches

GitBox Mon, 28 Dec 2020 14:18:14 -0800


andygrove opened a new pull request #9036:
URL: https://github.com/apache/arrow/pull/9036



   I am investigating why join performance is so bad with smaller batch sizes 
(see https://issues.apache.org/jira/browse/ARROW-11030)  and this is one 
optimization that I have found so far that helps a bit.
   
   Prior to this PR, we use the size of left or right batches to guess the 
capacity of output batches and this results in a lot of over allocation in some 
cases. For TPC-H q12 at SF=100, I see vectors created with capacity of 
~3,000,000 (the size of the build-side of the join) and then we only populate 
it with ~700 entries.
   
   This PR attempts to learn a good capacity based on previously processed 
batches.
   
   Here are query times in seconds at different batch sizes:
   
   Batch Size | Master | This PR
   -- | -- | --
   16384 | 189.6 | 158.0
   32768 | 61.9 | 47.2
   65536 | 28.2 | 21.4
   131072 | 19.0 | 15.6
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] andygrove opened a new pull request #9036: ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches

Reply via email to