[jira] Commented: (HIVE-609) optimize multi-group by

Zheng Shao (JIRA) Fri, 17 Jul 2009 15:42:43 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732765#action_12732765
 ]


Zheng Shao commented on HIVE-609:
---------------------------------

@hive.609.2.patch: Reviewed with namit offline. Here are the comments:
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:89      Used by 
hash distinct aggregation when hashGrpKeyNotRedKey is true
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:98      shall 
we rename it to "reduceGroupKeyIsDistinctExpr". I think this is more accurate 
than "groupbyKeyIsNotReduceKey"
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:401     I just 
found that "hashAggr" and "hashDistinctAggr" are always used together. We only 
need to pass one parameter to this function. What about some javadoc for this 
function (mainly for these parameters because I think they are not easy to 
understand)?
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3377  change 
name to getCommonDistinctExpr
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3397  assert 
not valid
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3404  instead 
of computing colExprMap for every reduceSinkDesc(), it can be computed offline 
all the info is available
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3462  this 
check is not needed
ql/src/java/org/apache/hadoop/hive/ql/plan/exprNodeDesc.java:40 more comments 
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:1590  more 
comments boolean var to track distPartAgg || (... DIST) In one case, the 
partial results have already been computed
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3314  merge 
with optimizeGroupby

> optimize multi-group by 
> ------------------------
>
>                 Key: HIVE-609
>                 URL: https://issues.apache.org/jira/browse/HIVE-609
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.609.1.patch, hive.609.2.patch
>
>
> For query like:
> from src
> insert overwrite table dest1 select col1, count(distinct colx) group by col1
> insert overwrite table dest2 select col2, count(distinct colx) group by col2;
> If map side aggregation is turned off, we currently do 4 map-reduce jobs.
> The plan can be optimized by running it in 3 map-reduce jobs, by spraying 
> over the
> distinct column first and then aggregating individual results.
> This may not be possible if there are multiple distinct columns, but the 
> above query is very common
> in data warehousing environments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-609) optimize multi-group by

Reply via email to