Boaz Ben-Zvi created DRILL-5588:
-----------------------------------

             Summary: Hash Aggregate: Avoid copy on output of aggregate columns
                 Key: DRILL-5588
                 URL: https://issues.apache.org/jira/browse/DRILL-5588
             Project: Apache Drill
          Issue Type: Improvement
          Components: Execution - Relational Operators
    Affects Versions: 1.10.0
            Reporter: Boaz Ben-Zvi


 When the Hash Aggregate operator outputs its result batches downstream, the 
key columns (value vectors) are returned as is, but for the aggregate columns 
new value vectors are allocated and the values are copied. This has an impact 
on performance. (see the method allocateOutgoing() ). A second effect is on 
memory management (as this allocation is not planned for by the code that 
controls spilling, etc).
   For some simple aggregate functions (e.g. SUM), the stored value vectors for 
the aggregate values can be returned as is. For functions like AVG, there is a 
need to divide the SUM values by the COUNT values. Still this can be done 
in-place (of the SUM values) and avoid new allocation and copy. 
   For VarChar type aggregate values (only used by MAX or MIN), there is 
another issue -- currently any such value vector is allocated as an 
ObjectVector (see BatchHolder()) (and on the JVM heap, not in direct memory). 
This is to manage the sizes of the values, which could change as the 
aggregation progresses (e.g., for MAX(name) -- first record has 'abe', but the 
next record has 'benjamin' which is both bigger ('b' > 'a') and longer). For 
the final output, this requires a new allocation and a copy in order to have a 
compact value vector in direct memory. Maybe the ObjectVector could be replaced 
with some direct memory implementation that is optimized for "good" values 
(e.g., all are of similar size), but penalized "bad" values (e.g., reallocates 
or moves values, when needed) ?






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to