cloud-fan commented on code in PR #56146: URL: https://github.com/apache/spark/pull/56146#discussion_r3424459461
########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala: ########## @@ -28,36 +28,30 @@ import org.apache.spark.sql.types._ import org.apache.spark.util.ArrayImplicits._ /** - * We need to take care of special floating numbers (NaN and -0.0) in several places: - * 1. When compare values, different NaNs should be treated as same, `-0.0` and `0.0` should be - * treated as same. - * 2. In aggregate grouping keys, different NaNs should belong to the same group, `-0.0` and `0.0` - * should belong to the same group. - * 3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be - * treated as same. - * 4. In window partition keys, different NaNs should belong to the same partition, `-0.0` - * and `0.0` should belong to the same partition. - * 5. In hash-based array set operations, different NaNs should be treated as same, `-0.0` - * and `0.0` should be treated as same. + * Certain pairs of floating point numbers require special handling: + * 1. 0.0 / -0.0 + * 2. NaN / NaN + * That's because we want to treat each of these pairs of numbers as equal, even though they + * have different binary representations. (IEEE 754 allows multiple distinct bit patterns for + * NaN.) * - * Case 1 is fine, as we handle NaN and `-0.0` well during comparison. For complex types, we - * recursively compare the fields/elements, so it's also fine. + * This special handling is required in several places: + * 1. When comparing values + * 2. When grouping keys for aggregates + * 3. When joining keys + * 4. When partitioning keys for windows + * 5. When executing array set operations * - * Case 2, 3 and 4 are problematic, as Spark SQL turns grouping/join/window partition keys into - * binary `UnsafeRow` and compare the binary data directly. Different NaNs have different binary - * representation, and the same thing happens for `-0.0` and `0.0`. - * - * Case 5 is problematic for a similar reason: hash-based array set operations compare elements by - * their binary representation via hash sets. + * Case 1 is already handled in [[SQLOrderingUtil]] and [[CodeGenerator.genEqual]]. + * Cases 2-5 are handled by this optimizer rule. Review Comment: Agreed that the high-level "why" — equal despite different binary representations — is best stated once, and L34-36 does that well. I don't want the per-case repetition back. But I think that's a different "why" than the one I was after. L34-36 explains why these pairs *are* equal; it doesn't explain why we need a logical-plan rule to normalize them. Case 1 already shows equality alone is handled without any rule (in `SQLOrderingUtil`/`genEqual`). The reason cases 3-5 need normalization is that Spark compares grouping/join/window keys as raw `UnsafeRow` bytes, and array set ops hash the binary representation — both bypass that case-1 equality logic. That's the insight a future reader needs to understand why this lives here rather than in `EqualTo`'s eval. The `UnsafeRow` half is still implied by the "Ideally we should do the normalization in the physical operators that compare the binary `UnsafeRow` directly" paragraph below, but the hash-set rationale for the array set ops is gone entirely. Could you add a single sentence right after "This special handling is required in several places:" — something like: "Spark compares these keys by their binary `UnsafeRow` representation, and array set ops compare elements via hash sets over that representation; both bypass the equality logic from case 1." That closes the gap without reintroducing the per-case repetition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
