[ https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858516#comment-13858516 ]
Sun Rui commented on HIVE-6120: ------------------------------- Fixed test failures. > Add GroupBy optimization to eliminate un-needed partial distinct aggregations > ----------------------------------------------------------------------------- > > Key: HIVE-6120 > URL: https://issues.apache.org/jira/browse/HIVE-6120 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Sun Rui > Assignee: Sun Rui > Attachments: HIVE-6120.1.patch, HIVE-6120.2.patch > > > In most cases, partial distinct aggregation is not needed in map-side > groupby. The exception is that with sorted bucketized tables partial distinct > aggregation can be done by the mappers in some scenarios, as what is done by > GroupByOptimzer. > Currently, partial distinct aggregation is done in the map-side GroupBy and > then shuffle of the partial result is done in the following ReduceSink > operator, in cases where they are not needed. This wastes CPU cycles, memory > and network bandwidth. > This optimization eliminates un-needed partial distinct aggregations, which > improves performance and reduces memory usage. > For example, > EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key; > Before optimization: > {noformat} > Group By Operator > aggregations: > expr: count(DISTINCT value) > bucketGroup: false > keys: > expr: key > type: int > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col2 > type: bigint > {noformat} > After optimization: > {noformat} > Group By Operator > bucketGroup: false > keys: > expr: key > type: int > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)