Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Ashutosh Chauhan Wed, 09 Jan 2013 18:24:12 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
-----------------------------------------------------------




ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
<https://reviews.apache.org/r/7126/#comment32714>

    I don't see any modifications to ql/if/queryplan.thrift Please mod that 
file appropriately and add the generated code  in the patch



ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32715>

    Unused import.



ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32716>

    Please add javadocs, explaining this class as well as the fact that there 
are currently two classes which extends this. Also, add difference in behavior 
of those two classes which necessitates the need for this base class.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/7126/#comment32811>

    Unused import.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/7126/#comment32812>

    Unused import.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/7126/#comment32813>

    please do return "CorrelationComposite" here instead of "CCO"



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32722>

    It seems like you are serializing and then immediately deserializing keys 
and values here, which I think is required for ReduceSinkOperator since keys 
and values are transferred from mapper process to reducer process. This is 
redundant in CLSReduceSinkOp since its all running inline in one operator 
pipeline in same memory. So, its looks like this could be avoided. 
    I guess doing this keeps implementation easier, but if this is true, we 
should take this up in follow-up jira as performance improvement.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32810>

    Does this imply CLSRSOperator cannot have more than one child operator in 
any case. If so, can you please add comments stating that along with small 
description why is that?



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32718>

    Please add comments here about why its overridden for empty implementation 
and how startGroup() is taken care of in processOp()



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32719>

    Please add comments here about why its overridden for empty implementation 
and how endGroup() is dealt with in processOp()



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32720>

    Please add comments here about why its overridden for empty implementation 
and how this is taken care of in processOp()



ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
<https://reviews.apache.org/r/7126/#comment32814>

    I don't get the reason of introducing rowNumber in Operator. It doesn't 
look its required for optimization to take place correctly. Can you elaborate 
the need of this variable and different associated methods which are introduced 
along with it?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32815>

    Can you add comments clarifying whats the reason for this? 



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32816>

    Seems like the logic of auto conversion of join to map-join is same as the 
one used in CommonJoinResolver. If thats the case, instead of replicating the 
logic here, it will be good to refactor that logic out of CommonJoinResolver in 
some util function in some util class and then use that function here. That 
logic will likely change over course of time and risk getting diverged if we 
duplicate instead of reuse. 



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32817>

    If I am reading this correctly than, map-side groupbys are converted to 
reduce-side groupbys. I don't see any fundamental reason why correlation 
optimizer cannot work with map-side groupbys. Is this just the limitation of 
current implementation or is there any fundamental reason for it. Please add 
comments whatever the case is.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32828>

    Instead of changing the plan first and than undo the changes on the plan 
later on (in case optimizer can not  be applied) I am wondering a safer choice 
could have been to clone the plan and than do your changes on the cloned plan. 
If we find that plan can be optimized, only than change the original plan. This 
will be a bit more safer. Thoughts?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
<https://reviews.apache.org/r/7126/#comment32837>

    Can you add comments why it is not compatible with skew-join and 
groupby-skew optimizations?



ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
<https://reviews.apache.org/r/7126/#comment32838>

    Explain annotation should have worked, no ? Having this working will be 
very useful for debugging.


- Ashutosh Chauhan


On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7126/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2012, 7:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> This optimizer exploits intra-query correlations and merges multiple 
> correlated MapReduce jobs into one jobs. Open a new request since I have been 
> working on hive-git.
> 
> 
> This addresses bug HIVE-2206.
>     https://issues.apache.org/jira/browse/HIVE-2206
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 
>   conf/hive-default.xml.template f332f3a 
>   
> ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
>  7c4c413 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 
>   ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION 
>   ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 
>   ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 
>   ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 
>   ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 
> 
> Diff: https://reviews.apache.org/r/7126/diff/
> 
> 
> Testing
> -------
> 
> All tests pass.
> 
> 
> Thanks,
> 
> Yin Huai
> 
>

Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Reply via email to