[GitHub] [spark] leanken commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

GitBox Mon, 20 Jul 2020 23:50:44 -0700


leanken commented on a change in pull request #29104:
URL: https://github.com/apache/spark/pull/29104#discussion_r457870914




##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -171,6 +171,23 @@
   private volatile MapIterator destructiveIterator = null;
   private LinkedList<UnsafeSorterSpillWriter> spillWriters = new 
LinkedList<>();
 
+  private boolean anyNullKeyExists = false;
+
+  public boolean inputEmpty()
+  {
+    return ((numKeys == 0) && !anyNullKeyExists);
+  }
+
+  public boolean isAnyNullKeyExists()
+  {
+    return anyNullKeyExists;
+  }
+
+  public void setAnyNullKeyExists(boolean anyNullKeyExists)
+  {
+    this.anyNullKeyExists = anyNullKeyExists;

Review comment:
       ```
   yes, no extra scan is needed.
   I set anyNullKeyExists during going through the input iterator
   if input is empty or there are no null keys row, it will stay as default 
value false.
   
   while (input.hasNext) {
         val row = input.next().asInstanceOf[UnsafeRow]
         numFields = row.numFields()
         val key = keyGenerator(row)
         if (!key.anyNull) {
           val loc = binaryMap.lookup(key.getBaseObject, key.getBaseOffset, 
key.getSizeInBytes)
           val success = loc.append(
             key.getBaseObject, key.getBaseOffset, key.getSizeInBytes,
             row.getBaseObject, row.getBaseOffset, row.getSizeInBytes)
           if (!success) {
             binaryMap.free()
             // scalastyle:off throwerror
             throw new SparkOutOfMemoryError("There is not enough memory to 
build hash map")
             // scalastyle:on throwerror
           }
         } else {
           binaryMap.setAnyNullKeyExists(true)   // HERE
         }
       }
   
   
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] leanken commented on a change in pull request #29104: [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

Reply via email to