> On May 23, 2017, 5 p.m., Ashutosh Chauhan wrote: > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java > > Lines 61 (patched) > > <https://reviews.apache.org/r/59468/diff/1/?file=1727326#file1727326line61> > > > > Comment: Queries of form : select max(c), count(distinct c) from T; > > generates a plan of form TS->mGBy->RS->rGBy->FS > > This plan suffers from a problem that vertex containing rGBy->FS > > necessarily need to have 1 task. This limitation results in slow execution > > because that task gets all the data. > > This optimization if successful will rewrite above plan to > > TS->mGby->RS->mGby2->RS->rGBy->FS This introduces extra vertex of mGby2->RS > > Note this vertex can have multiple tasks and since we are doing > > aggregation, output of this must necessarily be smaller than its input, > > which results in much less data going in to rGby->FS vertex, which > > continues to have single task. > > Also note on calcite tree we have HiveExpandDistinctAggregatesRule rule > > which does similiar plan transformation but has different conditions which > > needs to be satisified. > > Additionally, we don't do any costing here but this is possibly that > > this transformation may slow down query a bit since if data is small enough > > to fit in a single task of last reducer, injecting additional vertex in > > pipeline may make query slower.
Thanks for the detailed comments. > On May 23, 2017, 5 p.m., Ashutosh Chauhan wrote: > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java > > Lines 313 (patched) > > <https://reviews.apache.org/r/59468/diff/1/?file=1727326#file1727326line313> > > > > This should be PARTIAL2 mode as well, since GBy operator is running in > > Partial2 mode. partial2 is expecting integer as input. However, here we are counting key_col0, which is a string. Thus, hash is more appropriate. - pengcheng ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59468/#review175801 ----------------------------------------------------------- On May 25, 2017, 4:03 a.m., pengcheng xiong wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59468/ > ----------------------------------------------------------- > > (Updated May 25, 2017, 4:03 a.m.) > > > Review request for hive, Ashutosh Chauhan and Gopal V. > > > Repository: hive-git > > > Description > ------- > > HIVE-16654 > > > Diffs > ----- > > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 > itests/src/test/resources/testconfiguration.properties 47a13c93b9 > ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f > ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 > ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION > ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a > ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc > ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out > PRE-CREATION > ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 > ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 > ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b > ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f > ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b > ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e > ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 > > > Diff: https://reviews.apache.org/r/59468/diff/2/ > > > Testing > ------- > > > Thanks, > > pengcheng xiong > >