----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59468/#review175801 -----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 61 (patched) <https://reviews.apache.org/r/59468/#comment249127> Comment: Queries of form : select max(c), count(distinct c) from T; generates a plan of form TS->mGBy->RS->rGBy->FS This plan suffers from a problem that vertex containing rGBy->FS necessarily need to have 1 task. This limitation results in slow execution because that task gets all the data. This optimization if successful will rewrite above plan to TS->mGby->RS->mGby2->RS->rGBy->FS This introduces extra vertex of mGby2->RS Note this vertex can have multiple tasks and since we are doing aggregation, output of this must necessarily be smaller than its input, which results in much less data going in to rGby->FS vertex, which continues to have single task. Also note on calcite tree we have HiveExpandDistinctAggregatesRule rule which does similiar plan transformation but has different conditions which needs to be satisified. Additionally, we don't do any costing here but this is possibly that this transformation may slow down query a bit since if data is small enough to fit in a single task of last reducer, injecting additional vertex in pipeline may make query slower. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 121 (patched) <https://reviews.apache.org/r/59468/#comment249136> Unused field. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 122 (patched) <https://reviews.apache.org/r/59468/#comment249123> Comment : Position of distinct column in aggregator list of map Gby before rewrite. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 135 (patched) <https://reviews.apache.org/r/59468/#comment249134> Should this be cntDist > 1 ? ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 156-158 (patched) <https://reviews.apache.org/r/59468/#comment249130> This seems redundant since we already checked for cntDist != in loop. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 159 (patched) <https://reviews.apache.org/r/59468/#comment249133> Some extra safety checks: 1) mGby is in hash mode. 2) rGby is in mergepartial mode. 3) RS.getKeys().size() =1 4) RS partition column size = 1 5) RS sort col size = 1. 6) mGby has no grouping sets. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 197 (patched) <https://reviews.apache.org/r/59468/#comment249135> Comment : distinct is at lost position. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 313 (patched) <https://reviews.apache.org/r/59468/#comment249137> This should be PARTIAL2 mode as well, since GBy operator is running in Partial2 mode. ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java Lines 405 (patched) <https://reviews.apache.org/r/59468/#comment249122> throw new SemanticException(e); ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java Line 538 (original), 538-546 (patched) <https://reviews.apache.org/r/59468/#comment249138> This change may not be needed if we run Count distinct optimization after this has alreday run. ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java Lines 75 (patched) <https://reviews.apache.org/r/59468/#comment249141> Also, lets call this optimizaiton only for Tez. - Ashutosh Chauhan On May 22, 2017, 10:31 p.m., pengcheng xiong wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59468/ > ----------------------------------------------------------- > > (Updated May 22, 2017, 10:31 p.m.) > > > Review request for hive, Ashutosh Chauhan and Gopal V. > > > Repository: hive-git > > > Description > ------- > > HIVE-16654 > > > Diffs > ----- > > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 7dedd23591 > itests/src/test/resources/testconfiguration.properties e23ef6317f > ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java > 3233157d8d > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f > ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 > ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION > ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION > ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a > ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc > ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out > PRE-CREATION > ql/src/test/results/clientpositive/llap/metadataonly1.q.out 27218cf599 > ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 > ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 > ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b > ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f > ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b > ql/src/test/results/clientpositive/spark/nullgroup4.q.out 24f0291dec > ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e > ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 > > > Diff: https://reviews.apache.org/r/59468/diff/1/ > > > Testing > ------- > > > Thanks, > > pengcheng xiong > >