andygrove opened a new pull request #9036: URL: https://github.com/apache/arrow/pull/9036
I am investigating why join performance is so bad with smaller batch sizes (see https://issues.apache.org/jira/browse/ARROW-11030) and this is one optimization that I have found so far that helps a bit. Prior to this PR, we use the size of left or right batches to guess the capacity of output batches and this results in a lot of over allocation in some cases. For TPC-H q12 at SF=100, I see vectors created with capacity of ~3,000,000 (the size of the build-side of the join) and then we only populate it with ~700 entries. This PR attempts to learn a good capacity based on previously processed batches. Here are query times in seconds at different batch sizes: Batch Size | Master | This PR -- | -- | -- 16384 | 189.6 | 158.0 32768 | 61.9 | 47.2 65536 | 28.2 | 21.4 131072 | 19.0 | 15.6 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
