leanken commented on a change in pull request #29104:
URL: https://github.com/apache/spark/pull/29104#discussion_r459050176



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -454,6 +478,28 @@ case class BroadcastHashJoinExec(
     val (matched, checkCondition, _) = getJoinCondition(ctx, input)
     val numOutput = metricTerm(ctx, "numOutputRows")
 
+    if (isNullAwareAntiJoin) {
+      require(leftKeys.length == 1, "leftKeys length should be 1")
+      require(rightKeys.length == 1, "rightKeys length should be 1")
+      require(joinType == LeftAnti, "joinType must be LeftAnti.")
+      require(buildSide == BuildRight, "buildSide must be BuildRight.")
+      require(SQLConf.get.nullAwareAntiJoinOptimizeEnabled,
+        "nullAwareAntiJoinOptimizeEnabled must be on for null aware anti join 
optimize.")
+      require(checkCondition == "", "null aware anti join optimize condition 
should be empty.")
+
+      if (broadcastRelation.value.inputEmpty) {
+        return s"""
+           |// singleColumn NAAJ inputEmpty(true) accept all

Review comment:
       > BTW, if `numKeys == 0`, it means there is no null key? because there 
is no record at all. I'm wondering why you define `HashRelation.isEmpty` as 
`numKeys == 0 && ! anyNullKeyExists`
   
   if you go back to see the first commit of the PR, of using HashSet to store 
buildSide data, maybe you would understand why.
   
   HashedRelation is built for normal join, basically filter out null key row 
is suitable for normal Join. But it's totally different story for null aware 
join, because whether the buildSide(Or should i say rightSide) isEmpty or 
nullKeyExist matters a lot.
   
   so i ask you the history of hashedRelation about filter out null key rows.
   
   i think the HashedRelation code change being so weird, and confused it is 
because HashedRelation is built for normal join. if there are A 
NullAwareHashedRelation which can store nullKey rows, it would be more clear.
   
   I was going to implement such NullAwareHashedRelation when i try to support 
multi-column, that's why i want to convince your putting NAAJ into NewExecNode, 
because I think it's ideal that Normal Join deal with HashedRelation, and 
NullAwareJoin to deal with NullAwareHashedRelation separately.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to