xumingming commented on PR #56499:
URL: https://github.com/apache/spark/pull/56499#issuecomment-4747380711
@cloud-fan @uros-b Thanks for the review. The following changes are made:
- Added `isBinaryStable(a.dataType)` guard so we only do the substitution
for binary stable strings.
- Optimized the test to check the Join is indeed there.
- Removed the second pass, reused the replaceConstraints function to do that.
But there are some issues pop up, do you have suggestions how to proceed
with these two issues?.
**The first one** is that one of the test in
`InferFiltersFromConstraintsSuite.scala` now generates predicate like the
following:
```
testRelation2.where(IsNotNull($"b") && ($"b" === 1L) &&
($"b".attr.cast(IntegerType) === 1)).subquery("right")
```
There are two similar predicates for the `b == 1`.
**The second one is the TPCDS q78 plan changes(which caused the CI to fail)**
Before:
```text
Scan parquet ... store_sales ...
SubqueryBroadcast [d_date_sk] #1 <-- producer
Scan parquet ... web_sales ...
ReusedSubquery [d_date_sk] #1 <-- consumer of #1
Scan parquet ... catalog_sales ...
ReusedSubquery [d_date_sk] #1 <-- consumer of #1
```
After:
```text
Scan parquet ... store_sales ...
SubqueryBroadcast [d_date_sk] #1 <-- producer, unchanged
Scan parquet ... web_sales ...
SubqueryBroadcast [d_date_sk] #2 <-- NEW separate producer
Scan parquet ... catalog_sales ...
ReusedSubquery [d_date_sk] #2 <-- consumer of #2
```
Q78 plan-stability tests fail because the execution plan now builds two
date_dim dynamic-pruning subqueries instead of one. store_sales keeps the
original `EqualTo(d_year, 2000)` subquery, but web_sales now creates a
duplicate with `EqualNullSafe(d_year, 2000)`. Subquery-reuse detection treats
them as different, so catalog_sales reuses the new web_sales subquery and the
original broadcast is wasted.
**Where is the `EqualNullSafe(d_year, 2000)` from?** For the outer query’s
`ss LEFT JOIN ws ON ws_sold_year = ss_sold_year ...`, the optimizer generates
an `EqualNullSafe(ws_sold_year, ss_sold_year)` constraint (needed for LEFT JOIN
null semantics).
The new literal-substitution inference in
`QueryPlanConstraints.inferAdditionalConstraints` substitutes `ss_sold_year →
2000` into that `EqualNullSafe`, producing:
```scala
EqualNullSafe(ws_sold_year, 2000)
```
That inferred predicate is then pushed down through the `ws` CTE and into
its `date_dim` dynamic-pruning subquery, where it becomes
`EqualNullSafe(d_year, 2000)`.
The original code never substituted literals into `EqualNullSafe`, so it
never created this predicate. It only inferred `EqualTo(ws_sold_year, 2000)`
from the `EqualTo(ws_sold_year, ss_sold_year)` constraint.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]