agrawaldevesh commented on a change in pull request #29104:
URL: https://github.com/apache/spark/pull/29104#discussion_r459223391
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -454,6 +478,28 @@ case class BroadcastHashJoinExec(
val (matched, checkCondition, _) = getJoinCondition(ctx, input)
val numOutput = metricTerm(ctx, "numOutputRows")
+ if (isNullAwareAntiJoin) {
+ require(leftKeys.length == 1, "leftKeys length should be 1")
+ require(rightKeys.length == 1, "rightKeys length should be 1")
+ require(joinType == LeftAnti, "joinType must be LeftAnti.")
+ require(buildSide == BuildRight, "buildSide must be BuildRight.")
+ require(SQLConf.get.nullAwareAntiJoinOptimizeEnabled,
+ "nullAwareAntiJoinOptimizeEnabled must be on for null aware anti join
optimize.")
+ require(checkCondition == "", "null aware anti join optimize condition
should be empty.")
+
+ if (broadcastRelation.value.inputEmpty) {
+ return s"""
+ |// singleColumn NAAJ inputEmpty(true) accept all
+ |$numOutput.add(1);
+ |${consume(ctx, input)}
+ """.stripMargin
+ } else if (broadcastRelation.value.anyNullKeyExists) {
+ return s"""
+ |// singleColumn NAAJ inputEmpty(false) anyNullKeyExists(true)
reject all
Review comment:
> @maryannxue @agrawaldevesh @maropu @viirya what do you think?
Thanks for the empirical analysis @cloud-fan ... that always helps to seal
an argument :-)
I do agree that introducing a new node is sometimes more pain than worth, so
I would be equally be okay with just modifying BHJ. I think they are both fine
choices and I would leave it to @leanken to make a call. The ideal scenario
would be to have minimal code diff in BHJ and introducing a new node is only a
fallback if the diff cannot be reduced to our satisfaction.
Either way, we ought to decide one way or the other soon since the PR is
becoming slightly confusing with both approaches co-existing :-)
Are there any performance benchmarks that we should be re-running for BHJ to
ensure no regression ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]