[GitHub] [spark] agrawaldevesh commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

GitBox Wed, 22 Jul 2020 22:37:34 -0700


agrawaldevesh commented on a change in pull request #29104:
URL: https://github.com/apache/spark/pull/29104#discussion_r459223391




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -454,6 +478,28 @@ case class BroadcastHashJoinExec(
     val (matched, checkCondition, _) = getJoinCondition(ctx, input)
     val numOutput = metricTerm(ctx, "numOutputRows")
 
+    if (isNullAwareAntiJoin) {
+      require(leftKeys.length == 1, "leftKeys length should be 1")
+      require(rightKeys.length == 1, "rightKeys length should be 1")
+      require(joinType == LeftAnti, "joinType must be LeftAnti.")
+      require(buildSide == BuildRight, "buildSide must be BuildRight.")
+      require(SQLConf.get.nullAwareAntiJoinOptimizeEnabled,
+        "nullAwareAntiJoinOptimizeEnabled must be on for null aware anti join 
optimize.")
+      require(checkCondition == "", "null aware anti join optimize condition 
should be empty.")
+
+      if (broadcastRelation.value.inputEmpty) {
+        return s"""
+           |// singleColumn NAAJ inputEmpty(true) accept all
+           |$numOutput.add(1);
+           |${consume(ctx, input)}
+         """.stripMargin
+      } else if (broadcastRelation.value.anyNullKeyExists) {
+        return s"""
+           |// singleColumn NAAJ inputEmpty(false) anyNullKeyExists(true) 
reject all

Review comment:
       > @maryannxue @agrawaldevesh @maropu @viirya what do you think?
   
   Thanks for the empirical analysis @cloud-fan ... that always helps to seal 
an argument :-) 
   
   I do agree that introducing a new node is sometimes more pain than worth, so 
I would be equally be okay with just modifying BHJ. I think they are both fine 
choices and I would leave it to @leanken to make a call. The ideal scenario 
would be to have minimal code diff in BHJ and introducing a new node is only a 
fallback if the diff cannot be reduced to our satisfaction.
   
   Either way, we ought to decide one way or the other soon since the PR is 
becoming slightly confusing with both approaches co-existing :-)
   
   Are there any performance benchmarks that we should be re-running for BHJ to 
ensure no regression ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] agrawaldevesh commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

Reply via email to