Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Ashutosh Chauhan Tue, 23 May 2017 10:01:12 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review175801
-----------------------------------------------------------





ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 61 (patched)
<https://reviews.apache.org/r/59468/#comment249127>

    Comment: Queries of form : select max(c), count(distinct c) from T; 
generates a plan of form TS->mGBy->RS->rGBy->FS 
    This plan suffers from a problem that vertex containing rGBy->FS 
necessarily need to have 1 task. This limitation results in slow execution 
because that task gets all the data. 
    This optimization if successful will rewrite above plan to 
TS->mGby->RS->mGby2->RS->rGBy->FS This introduces extra vertex of mGby2->RS 
Note this vertex can have multiple tasks and since we are doing aggregation, 
output of this must necessarily be smaller than its input, which results in 
much less data going in to rGby->FS vertex, which continues to have single task.
    Also note on calcite tree we have HiveExpandDistinctAggregatesRule rule 
which does similiar plan transformation but has different conditions which 
needs to be satisified.
    Additionally, we don't do any costing here but this is possibly that this 
transformation may slow down query a bit since if data is small enough to fit 
in a single task of last reducer, injecting additional vertex in pipeline may 
make query slower.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 121 (patched)
<https://reviews.apache.org/r/59468/#comment249136>

    Unused field.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 122 (patched)
<https://reviews.apache.org/r/59468/#comment249123>

    Comment : Position of distinct column in aggregator list of map Gby before 
rewrite.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 135 (patched)
<https://reviews.apache.org/r/59468/#comment249134>

    Should this be cntDist > 1 ?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 156-158 (patched)
<https://reviews.apache.org/r/59468/#comment249130>

    This seems redundant since we already checked for cntDist != in loop.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 159 (patched)
<https://reviews.apache.org/r/59468/#comment249133>

    Some extra safety checks:
    1) mGby is in hash mode.
    2) rGby is in mergepartial mode.
    3) RS.getKeys().size() =1 
    4) RS partition column size = 1
    5) RS sort col size = 1.
    6) mGby has no grouping sets.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 197 (patched)
<https://reviews.apache.org/r/59468/#comment249135>

    Comment : distinct is at lost position.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 313 (patched)
<https://reviews.apache.org/r/59468/#comment249137>

    This should be PARTIAL2 mode as well, since GBy operator is running in 
Partial2 mode.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 405 (patched)
<https://reviews.apache.org/r/59468/#comment249122>

    throw new SemanticException(e);



ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java
Line 538 (original), 538-546 (patched)
<https://reviews.apache.org/r/59468/#comment249138>

    This change may not be needed if we run Count distinct optimization after 
this has alreday run.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
Lines 75 (patched)
<https://reviews.apache.org/r/59468/#comment249141>

    Also, lets call this optimizaiton only for Tez.


- Ashutosh Chauhan


On May 22, 2017, 10:31 p.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 22, 2017, 10:31 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 7dedd23591 
>   itests/src/test/resources/testconfiguration.properties e23ef6317f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java 
> 3233157d8d 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/metadataonly1.q.out 27218cf599 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/spark/nullgroup4.q.out 24f0291dec 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
>   ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Reply via email to