GitHub user ilooner opened a pull request:
https://github.com/apache/drill/pull/1101
DRILL-6032: Made the batch sizing for HashAgg more accurate.
## RecordBatchSizer changes
- The RecordBatchSizer previously computed fixed width column sizes by
measuring the total size of a vector and dividing by the number of elements.
Because of this the RecordBatchSizer would return a zero size for FixedWidth
vectors that had no data. So I added a method to FixedWidth vectors to get the
size of a record and use that method to compute the column width in the
RecordBatchSizer.⨠- In some cases it was possible for the RecordBatchSizer
to return a column width of 0, when it is not possible to have vectors with a
width of 1 in practice. So I made the minimum column width returned by the
RecordBatchSizer 1.
â¨
## HashAgg changes
- Removed commented out code and unused variables.
- Removed if statements for printing debug statements and instead used
logger.debug
- Removed the extraNonNullColumns and extraRowBytes tweak parameters for
computing the sizes of batches
- The RecordBatchSizer is used to compute the width of each column instead
of adhoc custom logic.
- Using the real width of each column to estimate column sizes instead of
taking the max width of all columns and assuming each column has the max width
- Removed the assumption that varchars will not exceed 50 characters in
length
- Removed unnecessary condition checks in delayedSetup and
updateEstMaxBatchSize
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ilooner/drill DRILL-6032
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1101.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1101
----
commit b8682f6e09889e5ba334f36006fb9ed754f571f6
Author: Timothy Farkas <timothyfarkas@...>
Date: 2017-12-13T23:44:28Z
DRILL-6032: Made the batch sizing for HashAgg more accurate.
----
---