[
https://issues.apache.org/jira/browse/DRILL-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131486#comment-16131486
]
Boaz Ben-Zvi commented on DRILL-5588:
-------------------------------------
See DRILL-5728 : The generated code allocates a special bigint value vector to
hold the "nullable" bits for the "real" values value vector. This
implementation would need to be changed (to a nullable value vector for the
values) so we could return the values value vector as is.
> Hash Aggregate: Avoid copy on output of aggregate columns
> ---------------------------------------------------------
>
> Key: DRILL-5588
> URL: https://issues.apache.org/jira/browse/DRILL-5588
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Relational Operators
> Affects Versions: 1.10.0
> Reporter: Boaz Ben-Zvi
>
> When the Hash Aggregate operator outputs its result batches downstream, the
> key columns (value vectors) are returned as is, but for the aggregate columns
> new value vectors are allocated and the values are copied. This has an impact
> on performance. (see the method allocateOutgoing() ). A second effect is on
> memory management (as this allocation is not planned for by the code that
> controls spilling, etc).
> For some simple aggregate functions (e.g. SUM), the stored value vectors
> for the aggregate values can be returned as is. For functions like AVG, there
> is a need to divide the SUM values by the COUNT values. Still this can be
> done in-place (of the SUM values) and avoid new allocation and copy.
> For VarChar type aggregate values (only used by MAX or MIN), there is
> another issue -- currently any such value vector is allocated as an
> ObjectVector (see BatchHolder()) (and on the JVM heap, not in direct memory).
> This is to manage the sizes of the values, which could change as the
> aggregation progresses (e.g., for MAX(name) -- first record has 'abe', but
> the next record has 'benjamin' which is both bigger ('b' > 'a') and longer).
> For the final output, this requires a new allocation and a copy in order to
> have a compact value vector in direct memory. Maybe the ObjectVector could be
> replaced with some direct memory implementation that is optimized for "good"
> values (e.g., all are of similar size), but penalized "bad" values (e.g.,
> reallocates or moves values, when needed) ?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)