[
https://issues.apache.org/jira/browse/HIVE-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732765#action_12732765
]
Zheng Shao commented on HIVE-609:
---------------------------------
@hive.609.2.patch: Reviewed with namit offline. Here are the comments:
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:89 Used by
hash distinct aggregation when hashGrpKeyNotRedKey is true
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:98 shall
we rename it to "reduceGroupKeyIsDistinctExpr". I think this is more accurate
than "groupbyKeyIsNotReduceKey"
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:401 I just
found that "hashAggr" and "hashDistinctAggr" are always used together. We only
need to pass one parameter to this function. What about some javadoc for this
function (mainly for these parameters because I think they are not easy to
understand)?
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3377 change
name to getCommonDistinctExpr
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3397 assert
not valid
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3404 instead
of computing colExprMap for every reduceSinkDesc(), it can be computed offline
all the info is available
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3462 this
check is not needed
ql/src/java/org/apache/hadoop/hive/ql/plan/exprNodeDesc.java:40 more comments
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:1590 more
comments boolean var to track distPartAgg || (... DIST) In one case, the
partial results have already been computed
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3314 merge
with optimizeGroupby
> optimize multi-group by
> ------------------------
>
> Key: HIVE-609
> URL: https://issues.apache.org/jira/browse/HIVE-609
> Project: Hadoop Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Namit Jain
> Assignee: Namit Jain
> Attachments: hive.609.1.patch, hive.609.2.patch
>
>
> For query like:
> from src
> insert overwrite table dest1 select col1, count(distinct colx) group by col1
> insert overwrite table dest2 select col2, count(distinct colx) group by col2;
> If map side aggregation is turned off, we currently do 4 map-reduce jobs.
> The plan can be optimized by running it in 3 map-reduce jobs, by spraying
> over the
> distinct column first and then aggregating individual results.
> This may not be possible if there are multiple distinct columns, but the
> above query is very common
> in data warehousing environments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.