cloud-fan commented on code in PR #56499:
URL: https://github.com/apache/spark/pull/56499#discussion_r3449190104


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala:
##########
@@ -71,12 +72,17 @@ trait ConstraintHelper {
         val candidateConstraints = predicates - eq - EqualNullSafe(l, r)
         inferredConstraints ++= replaceConstraints(candidateConstraints, l, r)
         inferredConstraints ++= replaceConstraints(candidateConstraints, r, l)
+      case eq @ EqualTo(l: Attribute, r: Literal) if 
isBinaryStable(l.dataType) =>

Review Comment:
   The two new `EqualTo(attr, Literal)` cases fold the bound literal into 
*every* remaining predicate via `replaceConstraints(predicates - eq, l, r)`. 
That generality — not the optimization itself — is the common source of the 
three problems here: collation (now patched with `isBinaryStable`), the 
`EqualNullSafe(attr, literal)` that breaks subquery-reuse and regresses q78, 
and the redundant `cast(b)=1` duplicate (test nit).
   
   The root issue is that two different inferences are being squeezed into the 
same equality pattern-match. Equality *transitivity* (the existing 
`EqualTo(attr, attr)` case) and folding a constant into a *range* predicate are 
distinct concerns and want distinct code. I'd pull the literal-into-inequality 
work into its own pass rather than adding two cases here — the same way 
`constructIsNotNullConstraints` is a separate method invoked alongside 
`inferAdditionalConstraints`:
   
   1. Drop the two `EqualTo(l: Attribute, r: Literal)` / `EqualTo(l: Literal, 
r: Attribute)` cases from the match.
   2. Add a dedicated `inferRangeConstraints(predicates)` (invoked after the 
equality loop and unioned into `inferredConstraints` like the others), which:
      - collects the `Attribute -> Literal` bindings into a map first, keeping 
the `isBinaryStable` guard per bound attribute;
      - walks only the **inequality** predicates (`<`, `<=`, `>`, `>=`), and 
where one side is a pinned attribute — directly or under a numeric 
`Add`/`Subtract` offset — substitutes the literal and emits the folded bound 
(`b.v >= a.k` with `a.k = 5` => `b.v >= 5`; `b.v <= a.k + 10` => `b.v <= 15`);
      - keeps a derived bound only when it is actually new/tighter than an 
existing bound on that attribute (so re-runs don't churn).
   
   Because that pass never rewrites `EqualTo`/`EqualNullSafe`, the q78 
duplicate-subquery regression and the `cast(b)=1` duplicates cannot arise by 
construction; the collation guard attaches naturally to the bounded attribute; 
and equality propagation stays where it belongs, in the existing transitivity 
case. As a bonus this keeps the `Once`-batch idempotence reasoning local to one 
place — a range bound is only ever emitted when strictly tighter, so the pass 
is self-idempotent without needing the `EqualNullSafe` carve-out the equality 
case relies on.
   
   If you'd rather keep the current shape for this round, the **minimum** to 
unblock q78 is to exclude `EqualNullSafe` from the targets in both cases:
   
   ```scala
   case eq @ EqualTo(l: Attribute, r: Literal) if isBinaryStable(l.dataType) =>
     inferredConstraints ++=
       replaceConstraints((predicates - 
eq).filterNot(_.isInstanceOf[EqualNullSafe]), l, r)
   ```
   
   That restores the original q78 plan with no golden-file churn (the old code 
inferred nothing from the LEFT JOIN's `EqualNullSafe`, since the attr-attr case 
only matched `EqualTo`), but leaves the duplicate-equality churn in place. Note 
the sibling `EqualTo(attr, attr)` case at lines 70-72 already excludes 
`EqualNullSafe` for the same idempotence reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to