leanken commented on pull request #29104:
URL: https://github.com/apache/spark/pull/29104#issuecomment-659750039


   Hi. @agrawaldevesh 
   I am afraid that putting the optimize into BroadcastHashJoinExec is not that 
easy.
   right now, I've got
   BroadcastNestedLoopJoinExec(LeftAnti with condition Or(EqualTo(a=b), 
IsNull(EqualTo(a=b))))
   
   if i want to translate into BroadcastHashJoinExec, first of all i need a 
join key, right?
   BroadcastHashJoinExec(LeftAnti joinKey(a=b), with condition)
   But the EquiJoinKeys itself already break the integrity of the origin 
condition Or(EqualTo(a=b), IsNull(EqualTo(a=b))
   
   Let's see what codegenAnti is like:
   
   ```
   s"""
            |boolean $found = false;
            |// generate join key for stream side
            |${keyEv.code}
            |// Check if the key has nulls.
            |if (!($anyNull)) {
            |  // Check if the HashedRelation exists.
            |  UnsafeRow $matched = 
(UnsafeRow)$relationTerm.getValue(${keyEv.value});
            |  if ($matched != null) {
            |    // Evaluate the condition.
            |    $checkCondition {
            |      $found = true;
            |    }
            |  }
            |}
            |if (!$found) {
            |  $numOutput.add(1);
            |  ${consume(ctx, input)}
            |}
          """.stripMargin
   ```
   
   antiJoin with Key will keep streamedSideRow if streamedSide key is a null, 
but it's totally opposite in NotInSubquery. I can certainly do some if-else 
check here, but it might mess up the whole BroadcastHashJoinExec Code.
   
   Besides the streamedSide key null difference, need to go through the entire 
buildSide to see if there is a null key exists, that's also kind of weird.
   
   BroadcastHashJoinExec assume that it has join key, but if i apply my 
NotInSubquery check here, it would like, hey, I found two key should be joined, 
but wait a minute, there are a tiny corner case here, so back off.
   
   if it's up to me to choose, i won't choose to break integrity of 
BroadcastHashJoinExec, i would rather count NotInSubquerySingleColumn as an 
runtime optimize.
   
   So, I am polling out the relative information for you guys, seeking advice 
till I move forward to next step. 
   
   Choose A.
   Count NotInSubquerySingleColumn as runtime optimize
   
   Choose B.
   Move code into BroadcastHashJoinExec but Codegen looks tricky.
   
   looking for your reply, many many thanks.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to