maropu commented on a change in pull request #29304:
URL: https://github.com/apache/spark/pull/29304#discussion_r464068612



##########
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
##########
@@ -591,6 +591,18 @@ public boolean anyNull() {
     return BitSetMethods.anySet(baseObject, baseOffset, bitSetWidthInBytes / 
8);
   }
 
+  /**
+   * return whether an UnsafeRow is null on every column.

Review comment:
       nit: `return` -> `Return` and please add tests in `UnsafeRowSuite` if 
you add a new method here.

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -391,35 +393,41 @@ object PhysicalWindow {
   }
 }
 
-object ExtractSingleColumnNullAwareAntiJoin extends JoinSelectionHelper with 
PredicateHelper {
-
-  // TODO support multi column NULL-aware anti join in future.
-  // See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6
-  // multi-column null aware anti join is much more complicated than single 
column ones.
+object ExtractNullAwareAntiJoinKeys extends JoinSelectionHelper with 
PredicateHelper {

Review comment:
       hm, I think its better to add some fine-grained tests for this 
extractor. WDHY?

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -245,7 +244,7 @@ case class BroadcastHashJoinExec(
            |boolean $found = false;
            |// generate join key for stream side
            |${keyEv.code}
-           |if ($anyNull) {
+           |if (${if (isLongHashedRelation) s"$anyNull" else 
s"${keyEv.value}.allNull()"}) {

Review comment:
       Does this work correctly?, e.g., the case where rewriting 4 short join 
keys `(short, short, short, short)` into a single long key?

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
##########
@@ -580,4 +580,100 @@ class HashedRelationSuite extends SharedSparkSession {
       assert(proj(packedKeys).get(0, dt) == -i - 1)
     }
   }
+
+  test("NullAwareHashedRelation") {

Review comment:
       Could you make the test title clearer about what is this test for?

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##########
@@ -327,11 +327,27 @@ private[joins] object UnsafeHashedRelation {
     // Create a mapping of buildKeys -> rows
     val keyGenerator = UnsafeProjection.create(key)
     var numFields = 0
+    val nullPaddingCombinations: Seq[UnsafeProjection] = if (isNullAware) {
+      // C(numKeys, 0), C(numKeys, 1) ... C(numKeys, numKeys - 1)
+      // In total 2^numKeys - 1 records will be appended.
+      key.indices.flatMap { n =>
+        key.indices.combinations(n).map { combination =>
+          // combination is Seq[Int] indicates which key should be replaced to 
null padding
+          val exprs = key.indices.map { index =>
+            if (combination.contains(index)) {
+              Literal.create(null, key(index).dataType)

Review comment:
       We need to expand it like this even if `key(index).nullable=false`?

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
##########
@@ -391,35 +393,41 @@ object PhysicalWindow {
   }
 }
 
-object ExtractSingleColumnNullAwareAntiJoin extends JoinSelectionHelper with 
PredicateHelper {
-
-  // TODO support multi column NULL-aware anti join in future.
-  // See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

Review comment:
       How about keeping this comment for the reference to NAAJ?  I think this 
is a good material to understand how it works. (NOTE: IMHO the link to [the ACM 
page](https://dl.acm.org/doi/10.14778/1687553.1687563) is better than the 
direct link to the PDF).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to