nchammas commented on code in PR #56146:
URL: https://github.com/apache/spark/pull/56146#discussion_r3447171800


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala:
##########
@@ -28,36 +28,30 @@ import org.apache.spark.sql.types._
 import org.apache.spark.util.ArrayImplicits._
 
 /**
- * We need to take care of special floating numbers (NaN and -0.0) in several 
places:
- *   1. When compare values, different NaNs should be treated as same, `-0.0` 
and `0.0` should be
- *      treated as same.
- *   2. In aggregate grouping keys, different NaNs should belong to the same 
group, `-0.0` and `0.0`
- *      should belong to the same group.
- *   3. In join keys, different NaNs should be treated as same, `-0.0` and 
`0.0` should be
- *      treated as same.
- *   4. In window partition keys, different NaNs should belong to the same 
partition, `-0.0`
- *      and `0.0` should belong to the same partition.
- *   5. In hash-based array set operations, different NaNs should be treated 
as same, `-0.0`
- *      and `0.0` should be treated as same.
+ * Certain pairs of floating point numbers require special handling:
+ *   1. 0.0 / -0.0
+ *   2. NaN / NaN
+ * That's because we want to treat each of these pairs of numbers as equal, 
even though they
+ * have different binary representations. (IEEE 754 allows multiple distinct 
bit patterns for
+ * NaN.)
  *
- * Case 1 is fine, as we handle NaN and `-0.0` well during comparison. For 
complex types, we
- * recursively compare the fields/elements, so it's also fine.
+ * This special handling is required in several places:
+ *   1. When comparing values
+ *   2. When grouping keys for aggregates
+ *   3. When joining keys
+ *   4. When partitioning keys for windows
+ *   5. When executing array set operations
  *
- * Case 2, 3 and 4 are problematic, as Spark SQL turns grouping/join/window 
partition keys into
- * binary `UnsafeRow` and compare the binary data directly. Different NaNs 
have different binary
- * representation, and the same thing happens for `-0.0` and `0.0`.
- *
- * Case 5 is problematic for a similar reason: hash-based array set operations 
compare elements by
- * their binary representation via hash sets.
+ * Case 1 is already handled in [[SQLOrderingUtil]] and 
[[CodeGenerator.genEqual]].
+ * Cases 2-5 are handled by this optimizer rule.

Review Comment:
   OK, I added some detail back about the comparison methods we use across the 
code base that need careful consideration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to