-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43115/
-----------------------------------------------------------

(Updated Feb. 4, 2016, 9:29 p.m.)


Review request for hive, Jesús Camacho Rodríguez and John Pullokkaran.


Changes
-------

Thanks John for the review.

The naming convention for the Distinct UDAF field for the GBY in the reduce 
side : <Last Reduce Key>:<Current Distinct UDF#>._col_<Distinct Key # in the 
current Distinct UDF>. It seems that currently we dont generate the colExprMap 
correctly for the above convention in HiveGBOpUtil.genMapSideRS(). The 
ReduceSide GBY pipeling looks good to me in the current return path code. Since 
we are not generating the entries for the correct columns in the MapSide Reduce 
Operator, we run into an exception when we look for an entry corresponding to a 
column in the reduce side aggreagation.

There is another optimization which can possibly done in the below 
scenario(after turning off mapside aggr):
explain FROM srcpart src SELECT count(DISTINCT src.value), count(DISTINCT 
src.key,src.key), sum(DISTINCT src.value) WHERE src.ds = '2008-04-08' GROUP BY 
substr(src.key,1,1);

The Reduce Operator Tree :
.......
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT 
KEY._col1:1._col0, KEY._col1:1._col1), sum(DISTINCT KEY._col1:2._col0)
          keys: KEY._col0 (type: string)
          mode: complete
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
          Select Operator
          ......
As you can see, 
1. KEY._col1:1._col0, KEY._col1:1._col1 is mapped to the same column and hence 
we could have used the same column in the rowschema of the ReduceSink Operator 
pipeline
2. KEY._col1:2._col0,  KEY._col1:0._col0 is mapped to the same column and we 
can do the same thing mentioned in 1.

I verified that this happens even in the non-return path code and should be 
covered as a general change as a further optimization in a separate jira.

Thanks
Hari


Repository: hive-git


Description
-------

CBO: Calcite Operator To Hive Operator (Calcite Return Path): TestCliDriver 
groupby_ppr_multi_distinct.q failure


Diffs (updated)
-----

  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/HiveGBOpConvUtil.java
 7fbf8cd 

Diff: https://reviews.apache.org/r/43115/diff/


Testing
-------

Precommit runs


Thanks,

Hari Sankar Sivarama Subramaniyan

Reply via email to