[
https://issues.apache.org/jira/browse/HIVE-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022336#comment-18022336
]
Tony Long commented on HIVE-28480:
----------------------------------
Hi [~himanshum] , I see this issue was fixed in 4.1.0 and 4.0.1 only. Can you
please help to confirm if this issue also affect Hive 3.1.3? or it only affect
Hive 4.0.0?
Thanks a lot.
> Disable SMB on partition hash generator mismatch across join branches in
> previous RS
> ------------------------------------------------------------------------------------
>
> Key: HIVE-28480
> URL: https://issues.apache.org/jira/browse/HIVE-28480
> Project: Hive
> Issue Type: Bug
> Components: Query Planning
> Reporter: Himanshu Mishra
> Assignee: Himanshu Mishra
> Priority: Critical
> Labels: hive-4.0.1-merged, hive-4.0.1-must,
> pull-request-available
> Fix For: 4.1.0, 4.0.1
>
>
> As SMB replaces last RS op from the joining branches and the JOIN op with
> MERGEJOIN, we need to ensure the RS before these RS, in both branches, are
> partitioning using same hash generator.
> Hash code generator differs based on ReducerTraits.UNIFORM i.e.
> [ReduceSinkOperator#computeMurmurHash() or
> ReduceSinkOperator#computeHashCode()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L340-L344],
> leading to different hash code for same value.
> Skip SMB join in such cases.
> h3. Replication:
> Consider following query, where join would get converted to SMB. Auto reducer
> is enabled which ensures more than 1 reducer task.
>
> {code:java}
> CREATE TABLE t_asj_18 (k STRING, v INT);
> INSERT INTO t_asj_18 values ('a', 10), ('a', 10);
> set hive.auto.convert.join=false;
> set hive.tez.auto.reducer.parallelism=true;
> EXPLAIN SELECT * FROM (
> SELECT k, COUNT(DISTINCT v), SUM(v)
> FROM t_asj_18 GROUP BY k
> ) a LEFT JOIN (
> SELECT k, COUNT(v)
> FROM t_asj_18 GROUP BY k
> ) b ON a.k = b.k; {code}
>
>
> Expected result is:
>
> {code:java}
> a 1 20 a 2 {code}
> but on master branch, it results in
>
>
> {code:java}
> a 1 20 NULL NULL {code}
>
>
> Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In
> such scenario [reducer trait UNIFORM is not
> set|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SetReducerParallelism.java#L99-L104].]
> The hash code for "a" from 2nd subquery is generated using murmurHash
> (270516725) while 1st is generated using bucketHash (1086686554) and result
> in rows with "a" key reaching different reducer tasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)