[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899365#action_12899365 ]
Amareshwari Sriramadasu commented on HIVE-741: ---------------------------------------------- Thanks Ning for your comments. bq. select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value; bq. select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value; This is happening because I'm assuming nr.get(0) in JoinOperator is the join-key. It seems it not always true that key is the first element in the ArrayList. When I modified a the code to the following, above queries are giving correct results. {code} StructObjectInspector soi = (StructObjectInspector) inputObjInspectors[tag]; StructField sf = soi.getStructFieldRef(Utilities.ReduceField.KEY .toString()); Object keyObject = soi.getStructFieldData(row, sf); if (SerDeUtils.isNullObject(keyObject, soi)) { endGroup(); startGroup(); } {code} Added method SerDeUtils.isNullObject(keyObject, soi) to know if the object passed is representing a NULL object. bq. select * FROM myinput1 a left outer JOIN myinput1 b right outer join myinput1 c ON a.value = b.value and b.value = c.value; Looking at Stage-1 of "explain" for the above query: {noformat} Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator sort order: tag: 0 value expressions: expr: key type: int expr: value type: int b TableScan alias: b Reduce Output Operator sort order: tag: 1 value expressions: expr: key type: int expr: value type: int Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col1, _col4, _col5 Filter Operator predicate: expr: (_col1 = _col5) type: boolean File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat {noformat} Join happens without join key? Here, join output is the Cartesian product of a and b on which FilterOperator is applied, Am I right? I see the semantics of inner/outer join on two tables without join condition is to produce Cartesian product. As a side note: "MySql does not allow outer joins without join condition". If Join is allowed without join condition to produce Cartesian product of the two tables, then my patch should be changed to consider if join-key is defined for the join or not. I could reproduce it by simple query "select * FROM myinput1 a JOIN myinput1 b". I think the same applies to MapJoin as well. bq. Verified that SMBMapJoinOperator already filters nulls properly. bq. Can you also add one or few tests for sort merge join? It seems my verification was wrong here, I thought if the table is sorted and hive.optimize.bucketmapjoin, hive.optimize.bucketmapjoin.sortedmerge are set to true, MapJoin uses SMBMapJoinOperator. But it was using MapJoinOperator it self. When I created a table with "sorted by" column, I see it using SMBMapJoinOperator. Currently if there are any nulls in the input table, SMBJoin fails with NullPointerException: {noformat} Caused by: java.lang.NullPointerException at org.apache.hadoop.io.IntWritable.compareTo(IntWritable.java:60) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.compareKeys(SMBMapJoinOperator.java:389) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processKey(SMBMapJoinOperator.java:438) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processOp(SMBMapJoinOperator.java:205) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:698) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.fetchOneRow(SMBMapJoinOperator.java:479) ... 17 more {noformat} Will look into this. bq. For inner, left and right outer joins, a simpler fix would be to add a filter on top. Now, I agree it would be simpler :). Will consider this also and see if i can do some special handling for full outer joins. > NULL is not handled correctly in join > ------------------------------------- > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug > Reporter: Ning Zhang > Assignee: Amareshwari Sriramadasu > Attachments: patch-741.txt > > > With the following data in table input4_cb: > Key Value > ------ -------- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL 325 18 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.