[
https://issues.apache.org/jira/browse/HIVE-20177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal V updated HIVE-20177:
---------------------------
Fix Version/s: 4.0.0
> Vectorization: Reduce KeyWrapper allocation in GroupBy Streaming mode
> ---------------------------------------------------------------------
>
> Key: HIVE-20177
> URL: https://issues.apache.org/jira/browse/HIVE-20177
> Project: Hive
> Issue Type: Bug
> Components: Vectorization
> Reporter: Gopal V
> Assignee: Gopal V
> Priority: Major
> Labels: performance
> Fix For: 4.0.0
>
> Attachments: HIVE-20177.01.patch, HIVE-20177.WIP.patch
>
>
> The streaming mode for VectorGroupBy allocates a large number of arrays due
> to VectorKeyHashWrapper::duplicateTo()
> Since the vectors can't be mutated in-place while a single batch is being
> processed, this operation can be cut by 1000x by allocating a streaming key
> at the end of the loop, instead of reallocating within the loop.
> {code}
> for(int i = 0; i < batch.size; ++i) {
> if (!batchKeys[i].equals(streamingKey)) {
> // We've encountered a new key, must save current one
> // We can't forward yet, the aggregators have not been evaluated
> rowsToFlush[flushMark] = currentStreamingAggregators;
> if (keysToFlush[flushMark] == null) {
> keysToFlush[flushMark] = (VectorHashKeyWrapper)
> streamingKey.copyKey();
> } else {
> streamingKey.duplicateTo(keysToFlush[flushMark]);
> }
> currentStreamingAggregators =
> streamAggregationBufferRowPool.getFromPool();
> batchKeys[i].duplicateTo(streamingKey);
> ++flushMark;
> }
> {code}
> The duplicateTo can be pushed out of the loop since there only one to truly
> keep a copy of is the last unique key in the VRB.
> The actual byte[] values within the keys are safely copied out by -
> VectorHashKeyWrapperBatch.assignRowColumn() which calls setVal() and not
> setRef().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)