[ https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858551#comment-13858551 ]
Hive QA commented on HIVE-6120: ------------------------------- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620791/HIVE-6120.2.patch {color:green}SUCCESS:{color} +1 4818 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/766/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/766/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12620791 > Add GroupBy optimization to eliminate un-needed partial distinct aggregations > ----------------------------------------------------------------------------- > > Key: HIVE-6120 > URL: https://issues.apache.org/jira/browse/HIVE-6120 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Sun Rui > Assignee: Sun Rui > Attachments: HIVE-6120.1.patch, HIVE-6120.2.patch > > > In most cases, partial distinct aggregation is not needed in map-side > groupby. The exception is that with sorted bucketized tables partial distinct > aggregation can be done by the mappers in some scenarios, as what is done by > GroupByOptimzer. > Currently, partial distinct aggregation is done in the map-side GroupBy and > then shuffle of the partial result is done in the following ReduceSink > operator, in cases where they are not needed. This wastes CPU cycles, memory > and network bandwidth. > This optimization eliminates un-needed partial distinct aggregations, which > improves performance and reduces memory usage. > For example, > EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key; > Before optimization: > {noformat} > Group By Operator > aggregations: > expr: count(DISTINCT value) > bucketGroup: false > keys: > expr: key > type: int > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col2 > type: bigint > {noformat} > After optimization: > {noformat} > Group By Operator > bucketGroup: false > keys: > expr: key > type: int > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)