sunchao commented on code in PR #55927:
URL: https://github.com/apache/spark/pull/55927#discussion_r3267941587
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala:
##########
@@ -324,6 +329,46 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
newChildren: IndexedSeq[Expression]): HashPartitioning = copy(expressions
= newChildren)
}
+/**
+ * Represents a hash partitioning for equi-join inputs where rows with a NULL
join key do not need
+ * to be co-located. Non-NULL join keys preserve the same partitioning
contract as
+ * [[HashPartitioning]], while rows with any NULL join key may be spread
across partitions.
+ */
+case class NullAwareHashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
Review Comment:
Yea this is an alternative design. The pros and cons are:
Pros:
* Much less duplicated code.
* Existing `HashPartitioning` plumbing can often be reused directly.
* `CoalescedHashPartitioning` and `HashShuffleSpec` can carry the flag
instead of requiring parallel classes.
* Fewer pattern-match branches across the codebase.
Cons:
* The semantic distinction becomes easier to overlook.
* A `HashPartitioning` with `spreadNullKeys = true` is no longer “ordinary
hash partitioning” in the old sense.
* Every place that reasons about `HashPartitioning` now has to remember to
inspect the flag before assuming strict same-key co-location.
* That is subtle and potentially error-prone because `HashPartitioning` is
already widely used.
* The class name no longer advertises the weaker contract; you would need
careful `toString`, docs, and audits to preserve the same clarity.
I'm a bit concerned about the cons since `HashPartitioning` is widely used
in the codebase and the change could have a bigger blast radius than just
adding another `NullAwareHashPartitioning`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]