[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896974#action_12896974 ]
Ning Zhang commented on HIVE-741: --------------------------------- The joins are implemented in the JoinOperator and CommonJoinOperators for regular reduce-side joins. The map-side joins are implemented in the MapJoinOperator. In the reduce side joins, the join keys are treated as distribution keys from the mappers to the reducers so that each group (marked by beginGroup() and endGroup()) will consists of rows with the same join keys. The reduce-side joins will cache all rows within a group except the last one (aka streaming table), which is scanned and cartesian producted with the cached rows of the other tables. I think the fix would be to check the NULL value of the join keys and do proper output based on the semantics of different types of joins. For the map-side join, it's basically a hash join where the small table is read in entirety in a hash table and probed while scanning the streaming table. There are other types of joins (bucketed map-side join, sort merge join etc.), but they all rely on the 3 classes mentioned above. Let me know if you have further questions for you to get started. > NULL is not handled correctly in join > ------------------------------------- > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug > Reporter: Ning Zhang > Assignee: Ning Zhang > > With the following data in table input4_cb: > Key Value > ------ -------- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL 325 18 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.