cloud-fan commented on code in PR #56499:
URL: https://github.com/apache/spark/pull/56499#discussion_r3449190104
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala:
##########
@@ -71,12 +72,17 @@ trait ConstraintHelper {
val candidateConstraints = predicates - eq - EqualNullSafe(l, r)
inferredConstraints ++= replaceConstraints(candidateConstraints, l, r)
inferredConstraints ++= replaceConstraints(candidateConstraints, r, l)
+ case eq @ EqualTo(l: Attribute, r: Literal) if
isBinaryStable(l.dataType) =>
Review Comment:
The two new `EqualTo(attr, Literal)` cases fold the bound literal into
*every* remaining predicate via `replaceConstraints(predicates - eq, l, r)`.
That generality — not the optimization itself — is the common source of the
three problems here: collation (now patched with `isBinaryStable`), the
`EqualNullSafe(attr, literal)` that breaks subquery-reuse and regresses q78,
and the redundant `cast(b)=1` duplicate (test nit).
The root issue is that two different inferences are being squeezed into the
same equality pattern-match. Equality *transitivity* (the existing
`EqualTo(attr, attr)` case) and folding a constant into a *range* predicate are
distinct concerns and want distinct code. I'd pull the literal-into-inequality
work into its own pass rather than adding two cases here — the same way
`constructIsNotNullConstraints` is a separate method invoked alongside
`inferAdditionalConstraints`:
1. Drop the two `EqualTo(l: Attribute, r: Literal)` / `EqualTo(l: Literal,
r: Attribute)` cases from the match.
2. Add a dedicated `inferRangeConstraints(predicates)` (invoked after the
equality loop and unioned into `inferredConstraints` like the others), which:
- collects the `Attribute -> Literal` bindings into a map first, keeping
the `isBinaryStable` guard per bound attribute;
- walks only the **inequality** predicates (`<`, `<=`, `>`, `>=`), and
where one side is a pinned attribute — directly or under a numeric
`Add`/`Subtract` offset — substitutes the literal and emits the folded bound
(`b.v >= a.k` with `a.k = 5` => `b.v >= 5`; `b.v <= a.k + 10` => `b.v <= 15`);
- keeps a derived bound only when it is actually new/tighter than an
existing bound on that attribute (so re-runs don't churn).
Because that pass never rewrites `EqualTo`/`EqualNullSafe`, the q78
duplicate-subquery regression and the `cast(b)=1` duplicates cannot arise by
construction; the collation guard attaches naturally to the bounded attribute;
and equality propagation stays where it belongs, in the existing transitivity
case. As a bonus this keeps the `Once`-batch idempotence reasoning local to one
place — a range bound is only ever emitted when strictly tighter, so the pass
is self-idempotent without needing the `EqualNullSafe` carve-out the equality
case relies on.
If you'd rather keep the current shape for this round, the **minimum** to
unblock q78 is to exclude `EqualNullSafe` from the targets in both cases:
```scala
case eq @ EqualTo(l: Attribute, r: Literal) if isBinaryStable(l.dataType) =>
inferredConstraints ++=
replaceConstraints((predicates -
eq).filterNot(_.isInstanceOf[EqualNullSafe]), l, r)
```
That restores the original q78 plan with no golden-file churn (the old code
inferred nothing from the LEFT JOIN's `EqualNullSafe`, since the attr-attr case
only matched `EqualTo`), but leaves the duplicate-equality churn in place. Note
the sibling `EqualTo(attr, attr)` case at lines 70-72 already excludes
`EqualNullSafe` for the same idempotence reason.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]