[jira] [Commented] (DRILL-5588) Hash Aggregate: Avoid copy on output of aggregate columns

Boaz Ben-Zvi (JIRA) Thu, 17 Aug 2017 16:41:48 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131486#comment-16131486
 ]


Boaz Ben-Zvi commented on DRILL-5588:
-------------------------------------

See DRILL-5728 : The generated code allocates a special bigint value vector to 
hold the "nullable" bits for the "real" values value vector. This 
implementation would need to be changed (to a nullable value vector for the 
values) so we could return the values value vector as is.
  

> Hash Aggregate: Avoid copy on output of aggregate columns
> ---------------------------------------------------------
>
>                 Key: DRILL-5588
>                 URL: https://issues.apache.org/jira/browse/DRILL-5588
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0
>            Reporter: Boaz Ben-Zvi
>
>  When the Hash Aggregate operator outputs its result batches downstream, the 
> key columns (value vectors) are returned as is, but for the aggregate columns 
> new value vectors are allocated and the values are copied. This has an impact 
> on performance. (see the method allocateOutgoing() ). A second effect is on 
> memory management (as this allocation is not planned for by the code that 
> controls spilling, etc).
>    For some simple aggregate functions (e.g. SUM), the stored value vectors 
> for the aggregate values can be returned as is. For functions like AVG, there 
> is a need to divide the SUM values by the COUNT values. Still this can be 
> done in-place (of the SUM values) and avoid new allocation and copy. 
>    For VarChar type aggregate values (only used by MAX or MIN), there is 
> another issue -- currently any such value vector is allocated as an 
> ObjectVector (see BatchHolder()) (and on the JVM heap, not in direct memory). 
> This is to manage the sizes of the values, which could change as the 
> aggregation progresses (e.g., for MAX(name) -- first record has 'abe', but 
> the next record has 'benjamin' which is both bigger ('b' > 'a') and longer). 
> For the final output, this requires a new allocation and a copy in order to 
> have a compact value vector in direct memory. Maybe the ObjectVector could be 
> replaced with some direct memory implementation that is optimized for "good" 
> values (e.g., all are of similar size), but penalized "bad" values (e.g., 
> reallocates or moves values, when needed) ?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5588) Hash Aggregate: Avoid copy on output of aggregate columns

Reply via email to