Re: [PR] [SPARK-54918][SQL] Normalize floating numbers in array set operations [spark]

via GitHub Tue, 26 May 2026 04:18:29 -0700


cloud-fan commented on code in PR #53695:
URL: https://github.com/apache/spark/pull/53695#discussion_r3303343892



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala:
##########
@@ -45,8 +47,11 @@ import org.apache.spark.util.ArrayImplicits._
  * binary `UnsafeRow` and compare the binary data directly. Different NaNs 
have different binary
  * representation, and the same thing happens for -0.0 and 0.0.
  *
- * This rule normalizes NaN and -0.0 in window partition keys, join keys and 
aggregate grouping
- * keys.
+ * Case 5 is problematic for a similar reason: hash-based array set operations 
compare elements by
+ * their binary representation via hash sets.
+ *
+ * This rule normalizes NaN and -0.0 in window partition keys, join keys, 
aggregate grouping
+ * keys, and the inputs of array set operations.

Review Comment:
   The class docstring is updated for case 5, but the "Note that…" paragraph 
just below (lines 61-62) still reads as if the late batch is the only 
invocation:
   
   > Note that, this rule must be executed at the end of optimizer, because the 
optimizer may create new joins(the subquery rewrite) and new join 
conditions(the join reorder).
   
   With this PR, the rule also runs early inside `FinishAnalysis` (between 
`ReplaceExpressions` and `EvalInlineTables`) so that array set ops are wrapped 
before `ConstantFolding` / `ConvertToLocalRelation` / `EvalInlineTables` can 
pre-evaluate them. The late batch only catches joins/conditions newly produced 
after `FinishAnalysis`. Worth updating the note so a future maintainer doesn't 
conclude the late batch alone is the contract — e.g.:
   
   ```suggestion
    * This rule runs in two places. It runs once inside `FinishAnalysis` (right 
after
    * `ReplaceExpressions` and before `EvalInlineTables`), so that array set 
operations are
    * wrapped before optimizer rules that pre-evaluate expressions (e.g. 
`ConstantFolding`,
    * `ConvertToLocalRelation`, `EvalInlineTables`) would otherwise miss them. 
It also runs as
    * a late batch at the end of the optimizer, because rules like subquery 
rewrite and join
    * reorder can create new joins/join conditions after `FinishAnalysis` that 
still need
    * their keys normalized.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54918][SQL] Normalize floating numbers in array set operations [spark]

Reply via email to