Re: [PR] [SPARK-56903][SQL] Spread NULL outer join keys across shuffle partitions [spark]

via GitHub Tue, 19 May 2026 09:34:26 -0700


sunchao commented on code in PR #55927:
URL: https://github.com/apache/spark/pull/55927#discussion_r3267941587



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala:
##########
@@ -324,6 +329,46 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
     newChildren: IndexedSeq[Expression]): HashPartitioning = copy(expressions 
= newChildren)
 }
 
+/**
+ * Represents a hash partitioning for equi-join inputs where rows with a NULL 
join key do not need
+ * to be co-located. Non-NULL join keys preserve the same partitioning 
contract as
+ * [[HashPartitioning]], while rows with any NULL join key may be spread 
across partitions.
+ */
+case class NullAwareHashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)

Review Comment:
   Yea this is an alternative design. The pros and cons are:
   
   Pros:
   * Much less duplicated code.
   * Existing `HashPartitioning` plumbing can often be reused directly.
   * `CoalescedHashPartitioning` and `HashShuffleSpec` can carry the flag 
instead of requiring parallel classes.
   * Fewer pattern-match branches across the codebase.
   
   Cons:
   * The semantic distinction becomes easier to overlook.
   * A `HashPartitioning` with `spreadNullKeys = true` is no longer “ordinary 
hash partitioning” in the old sense.
   * Every place that reasons about `HashPartitioning` now has to remember to 
inspect the flag before assuming strict same-key co-location.
   * That is subtle and potentially error-prone because `HashPartitioning` is 
already widely used.
   * The class name no longer advertises the weaker contract; you would need 
careful `toString`, docs, and audits to preserve the same clarity.
   
   I'm a bit concerned about the cons since `HashPartitioning` is widely used 
in the codebase and the change could have a bigger blast radius than just 
adding another `NullAwareHashPartitioning`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56903][SQL] Spread NULL outer join keys across shuffle partitions [spark]

Reply via email to