[ 
https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858516#comment-13858516
 ] 

Sun Rui commented on HIVE-6120:
-------------------------------

Fixed test failures.

> Add GroupBy optimization to eliminate un-needed partial distinct aggregations
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-6120
>                 URL: https://issues.apache.org/jira/browse/HIVE-6120
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>         Attachments: HIVE-6120.1.patch, HIVE-6120.2.patch
>
>
> In most cases, partial distinct aggregation is not needed in map-side 
> groupby. The exception is that with sorted bucketized tables partial distinct 
> aggregation can be done by the mappers in some scenarios, as what is done by 
> GroupByOptimzer.
> Currently, partial distinct aggregation is done in the map-side GroupBy and 
> then shuffle of the partial result is done in the following ReduceSink 
> operator, in cases where they are not needed. This wastes CPU cycles, memory 
> and network bandwidth.
> This optimization eliminates un-needed partial distinct aggregations, which 
> improves performance and reduces memory usage.
> For example,
> EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;
> Before optimization:
> {noformat}
>               Group By Operator
>                 aggregations:
>                       expr: count(DISTINCT value)
>                 bucketGroup: false
>                 keys:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1, _col2
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: int
>                         expr: _col1
>                         type: string
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: int
>                   tag: -1
>                   value expressions:
>                         expr: _col2
>                         type: bigint
> {noformat}
> After optimization:
> {noformat}
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: int
>                         expr: _col1
>                         type: string
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: int
>                   tag: -1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to