[ https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696736#comment-13696736 ]
Hudson commented on HIVE-4781: ------------------------------ Integrated in Hive-trunk-hadoop2 #266 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/266/]) HIVE-4781 : LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval (Yin Huai via Ashutosh Chauhan) (Revision 1498150) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1498150 Files : * /hive/trunk/build-common.xml * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java * /hive/trunk/ql/src/test/queries/clientpositive/leftsemijoin_mr.q * /hive/trunk/ql/src/test/results/clientpositive/leftsemijoin_mr.q.out > LEFT SEMI JOIN generates wrong results when the number of rows belonging to a > single key of the right table exceed hive.join.emit.interval > ------------------------------------------------------------------------------------------------------------------------------------------ > > Key: HIVE-4781 > URL: https://issues.apache.org/jira/browse/HIVE-4781 > Project: Hive > Issue Type: Bug > Affects Versions: 0.12.0 > Reporter: Yin Huai > Assignee: Yin Huai > Fix For: 0.12.0 > > Attachments: HIVE-4781.txt, wrong_semi_join.txt > > > Suppose that we have a query shown below > {code:sql} > SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); > {\code} > When the number of rows of t2 is larger than hive.join.emit.interval, > JoinOperator will emit rows from t1, which will result in redundant output. > Let's say t1 is > {code} > 1 > {\code} > and t2 is > {code} > 1 > 1 > 1 > 1 > {\code} > When hive.join.emit.interval=1, the output of above query will be > {code} > 1 > 1 > 1 > 1 > {\code} > The correct result should be > {code} > 1 > {\code} > This problem cannot be found in unit test. Because there is a GBY operator > inserted before JoinOperator and we have only 1 mapper, the output of map > phase only has distinct keys. > Please apply the patch 'wrong_semi_join.txt' attached below and use > {code} > ant test -Dtestcase=TestMinimrCliDriver -Dqfile="left_semi_join.q" > -Dtest.silent=false > {\code} to replay the problem. The wrong result can be found in > {code} > <hive_root_dir>/build/ql/test/logs/clientpositive > {\code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira