cloud-fan commented on code in PR #56146:
URL: https://github.com/apache/spark/pull/56146#discussion_r3424459461


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala:
##########
@@ -28,36 +28,30 @@ import org.apache.spark.sql.types._
 import org.apache.spark.util.ArrayImplicits._
 
 /**
- * We need to take care of special floating numbers (NaN and -0.0) in several 
places:
- *   1. When compare values, different NaNs should be treated as same, `-0.0` 
and `0.0` should be
- *      treated as same.
- *   2. In aggregate grouping keys, different NaNs should belong to the same 
group, `-0.0` and `0.0`
- *      should belong to the same group.
- *   3. In join keys, different NaNs should be treated as same, `-0.0` and 
`0.0` should be
- *      treated as same.
- *   4. In window partition keys, different NaNs should belong to the same 
partition, `-0.0`
- *      and `0.0` should belong to the same partition.
- *   5. In hash-based array set operations, different NaNs should be treated 
as same, `-0.0`
- *      and `0.0` should be treated as same.
+ * Certain pairs of floating point numbers require special handling:
+ *   1. 0.0 / -0.0
+ *   2. NaN / NaN
+ * That's because we want to treat each of these pairs of numbers as equal, 
even though they
+ * have different binary representations. (IEEE 754 allows multiple distinct 
bit patterns for
+ * NaN.)
  *
- * Case 1 is fine, as we handle NaN and `-0.0` well during comparison. For 
complex types, we
- * recursively compare the fields/elements, so it's also fine.
+ * This special handling is required in several places:
+ *   1. When comparing values
+ *   2. When grouping keys for aggregates
+ *   3. When joining keys
+ *   4. When partitioning keys for windows
+ *   5. When executing array set operations
  *
- * Case 2, 3 and 4 are problematic, as Spark SQL turns grouping/join/window 
partition keys into
- * binary `UnsafeRow` and compare the binary data directly. Different NaNs 
have different binary
- * representation, and the same thing happens for `-0.0` and `0.0`.
- *
- * Case 5 is problematic for a similar reason: hash-based array set operations 
compare elements by
- * their binary representation via hash sets.
+ * Case 1 is already handled in [[SQLOrderingUtil]] and 
[[CodeGenerator.genEqual]].
+ * Cases 2-5 are handled by this optimizer rule.

Review Comment:
   Agreed that the high-level "why" — equal despite different binary 
representations — is best stated once, and L34-36 does that well. I don't want 
the per-case repetition back.
   
   But I think that's a different "why" than the one I was after. L34-36 
explains why these pairs *are* equal; it doesn't explain why we need a 
logical-plan rule to normalize them. Case 1 already shows equality alone is 
handled without any rule (in `SQLOrderingUtil`/`genEqual`). The reason cases 
3-5 need normalization is that Spark compares grouping/join/window keys as raw 
`UnsafeRow` bytes, and array set ops hash the binary representation — both 
bypass that case-1 equality logic. That's the insight a future reader needs to 
understand why this lives here rather than in `EqualTo`'s eval.
   
   The `UnsafeRow` half is still implied by the "Ideally we should do the 
normalization in the physical operators that compare the binary `UnsafeRow` 
directly" paragraph below, but the hash-set rationale for the array set ops is 
gone entirely.
   
   Could you add a single sentence right after "This special handling is 
required in several places:" — something like: "Spark compares these keys by 
their binary `UnsafeRow` representation, and array set ops compare elements via 
hash sets over that representation; both bypass the equality logic from case 
1." That closes the gap without reintroducing the per-case repetition.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to