cloud-fan commented on code in PR #53695:
URL: https://github.com/apache/spark/pull/53695#discussion_r3303343892
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala:
##########
@@ -45,8 +47,11 @@ import org.apache.spark.util.ArrayImplicits._
* binary `UnsafeRow` and compare the binary data directly. Different NaNs
have different binary
* representation, and the same thing happens for -0.0 and 0.0.
*
- * This rule normalizes NaN and -0.0 in window partition keys, join keys and
aggregate grouping
- * keys.
+ * Case 5 is problematic for a similar reason: hash-based array set operations
compare elements by
+ * their binary representation via hash sets.
+ *
+ * This rule normalizes NaN and -0.0 in window partition keys, join keys,
aggregate grouping
+ * keys, and the inputs of array set operations.
Review Comment:
The class docstring is updated for case 5, but the "Note that…" paragraph
just below (lines 61-62) still reads as if the late batch is the only
invocation:
> Note that, this rule must be executed at the end of optimizer, because the
optimizer may create new joins(the subquery rewrite) and new join
conditions(the join reorder).
With this PR, the rule also runs early inside `FinishAnalysis` (between
`ReplaceExpressions` and `EvalInlineTables`) so that array set ops are wrapped
before `ConstantFolding` / `ConvertToLocalRelation` / `EvalInlineTables` can
pre-evaluate them. The late batch only catches joins/conditions newly produced
after `FinishAnalysis`. Worth updating the note so a future maintainer doesn't
conclude the late batch alone is the contract — e.g.:
```suggestion
* This rule runs in two places. It runs once inside `FinishAnalysis` (right
after
* `ReplaceExpressions` and before `EvalInlineTables`), so that array set
operations are
* wrapped before optimizer rules that pre-evaluate expressions (e.g.
`ConstantFolding`,
* `ConvertToLocalRelation`, `EvalInlineTables`) would otherwise miss them.
It also runs as
* a late batch at the end of the optimizer, because rules like subquery
rewrite and join
* reorder can create new joins/join conditions after `FinishAnalysis` that
still need
* their keys normalized.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]