[GitHub] [spark] leanken commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

GitBox Tue, 21 Jul 2020 20:00:24 -0700


leanken commented on a change in pull request #29104:
URL: https://github.com/apache/spark/pull/29104#discussion_r458489011




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##########
@@ -71,6 +71,18 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
    */
   def keyIsUnique: Boolean
 
+  /**
+   * Note that, the hashed relation can be empty despite the 
Iterator[InternalRow] being not empty,
+   * since the hashed relation skips over null keys.
+   */

Review comment:
       during building a hashedRelation, it could end up following case.
   
   1. input: iterator[InternalRow] itself is empty, than inputEmpty = true
   2. iterator contains row that has null values in any column, 
anyNullKeyExists = true, but row got filtered and not present in hashedRelation.
   3. normal not null row will be kept.
   
   inputEmpty != !anyNullKeyExists
   because even no null key row does not exist, normal row could still be. 
   
   this should be right, but i will add some comments to make it clear.
   def inputEmpty: Boolean = numKeys == 0 && !anyNullKeyExists




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] leanken commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

Reply via email to