[
https://issues.apache.org/jira/browse/HIVE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780463#comment-13780463
]
Hudson commented on HIVE-5357:
------------------------------
FAILURE: Integrated in Hive-trunk-h0.21 #2363 (See
[https://builds.apache.org/job/Hive-trunk-h0.21/2363/])
HIVE-5357 : ReduceSinkDeDuplication optimizer pick the wrong keys in
pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys in child GBY (Chun
Chen via Ashutosh Chauhan) (hashutosh:
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1526990)
*
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
* /hive/trunk/ql/src/test/queries/clientpositive/reduce_deduplicate_extended.q
*
/hive/trunk/ql/src/test/results/clientpositive/reduce_deduplicate_extended.q.out
> ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr
> scenario when there are distinct keys in child GBY
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5357
> URL: https://issues.apache.org/jira/browse/HIVE-5357
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.11.0
> Reporter: Chun Chen
> Assignee: Chun Chen
> Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: HIVE-5357.patch
>
>
> Example:
> {code}
> select key, count(distinct value) from (select key, value from src group by
> key, value) t group by key;
> //result
> 0 0 NULL
> 10 10 NULL
> 100 100 NULL
> 103 103 NULL
> 104 104 NULL
> {code}
> Obviously the result is wrong.
> When we have a simple group by query with a distinct column
> {code}
> explain select count(distinct value) from src group by key;
> {code}
> The plan is
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> src
> TableScan
> alias: src
> Select Operator
> expressions:
> expr: key
> type: string
> expr: value
> type: string
> outputColumnNames: key, value
> Group By Operator
> aggregations:
> expr: count(DISTINCT value)
> bucketGroup: false
> keys:
> expr: key
> type: string
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> tag: -1
> value expressions:
> expr: _col2
> type: bigint
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: count(DISTINCT KEY._col1:0._col0)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col1
> type: bigint
> outputColumnNames: _col0
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {code}
> The map side GBY also adds the distinct columns (value in this case) to its
> key columns.
> When RSDedup optimizes a query involving a GBY with distinct keys, if
> map-side aggregation is enabled, currently it assigns the map-side GBY's key
> columns to the reduce-side GBY. So, for the example shown at the beginning,
> after we generate a plan with a single MR job, the second GBY in the
> reduce-side uses both key and value as its key columns. The correct key
> column is key.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira