-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43115/
-----------------------------------------------------------
(Updated Feb. 4, 2016, 9:29 p.m.)
Review request for hive, Jesús Camacho Rodríguez and John Pullokkaran.
Changes
-------
Thanks John for the review.
The naming convention for the Distinct UDAF field for the GBY in the reduce
side : <Last Reduce Key>:<Current Distinct UDF#>._col_<Distinct Key # in the
current Distinct UDF>. It seems that currently we dont generate the colExprMap
correctly for the above convention in HiveGBOpUtil.genMapSideRS(). The
ReduceSide GBY pipeling looks good to me in the current return path code. Since
we are not generating the entries for the correct columns in the MapSide Reduce
Operator, we run into an exception when we look for an entry corresponding to a
column in the reduce side aggreagation.
There is another optimization which can possibly done in the below
scenario(after turning off mapside aggr):
explain FROM srcpart src SELECT count(DISTINCT src.value), count(DISTINCT
src.key,src.key), sum(DISTINCT src.value) WHERE src.ds = '2008-04-08' GROUP BY
substr(src.key,1,1);
The Reduce Operator Tree :
.......
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT
KEY._col1:1._col0, KEY._col1:1._col1), sum(DISTINCT KEY._col1:2._col0)
keys: KEY._col0 (type: string)
mode: complete
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
Select Operator
......
As you can see,
1. KEY._col1:1._col0, KEY._col1:1._col1 is mapped to the same column and hence
we could have used the same column in the rowschema of the ReduceSink Operator
pipeline
2. KEY._col1:2._col0, KEY._col1:0._col0 is mapped to the same column and we
can do the same thing mentioned in 1.
I verified that this happens even in the non-return path code and should be
covered as a general change as a further optimization in a separate jira.
Thanks
Hari
Repository: hive-git
Description
-------
CBO: Calcite Operator To Hive Operator (Calcite Return Path): TestCliDriver
groupby_ppr_multi_distinct.q failure
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/HiveGBOpConvUtil.java
7fbf8cd
Diff: https://reviews.apache.org/r/43115/diff/
Testing
-------
Precommit runs
Thanks,
Hari Sankar Sivarama Subramaniyan