xumingming commented on PR #56499:
URL: https://github.com/apache/spark/pull/56499#issuecomment-4747380711

   @cloud-fan @uros-b Thanks for the review. The following changes are made:
   
   - Added `isBinaryStable(a.dataType)` guard so we only do the substitution 
for binary stable strings.
   - Optimized the test to check the Join is indeed there.
   - Removed the second pass, reused the replaceConstraints function to do that.
   
   But there are some issues pop up, do you have suggestions how to proceed 
with these two issues?. 
   
   **The first one** is that one of the test in 
`InferFiltersFromConstraintsSuite.scala` now generates predicate like the 
following:
   
   ```
   testRelation2.where(IsNotNull($"b") && ($"b" === 1L) &&
             ($"b".attr.cast(IntegerType) === 1)).subquery("right")
   ```
   
   There are two similar predicates for the `b == 1`.
   
   
   **The second one is the TPCDS q78 plan changes(which caused the CI to fail)**
   
   Before:
   ```text
   Scan parquet ... store_sales ...
     SubqueryBroadcast [d_date_sk] #1   <-- producer
   Scan parquet ... web_sales ...
     ReusedSubquery [d_date_sk] #1      <-- consumer of #1
   Scan parquet ... catalog_sales ...
     ReusedSubquery [d_date_sk] #1      <-- consumer of #1
   ```
   
   After:
   ```text
   Scan parquet ... store_sales ...
     SubqueryBroadcast [d_date_sk] #1   <-- producer, unchanged
   Scan parquet ... web_sales ...
     SubqueryBroadcast [d_date_sk] #2   <-- NEW separate producer
   Scan parquet ... catalog_sales ...
     ReusedSubquery [d_date_sk] #2      <-- consumer of #2
   ```
   
   Q78 plan-stability tests fail because the execution plan now builds two 
date_dim dynamic-pruning subqueries instead of one. store_sales keeps the 
original `EqualTo(d_year, 2000)` subquery, but web_sales now creates a 
duplicate with `EqualNullSafe(d_year, 2000)`. Subquery-reuse detection treats 
them as different, so catalog_sales reuses the new web_sales subquery and the 
original broadcast is wasted.
   
   
   **Where is the `EqualNullSafe(d_year, 2000)` from?** For the outer query’s 
`ss LEFT JOIN ws ON ws_sold_year = ss_sold_year ...`, the optimizer generates 
an `EqualNullSafe(ws_sold_year, ss_sold_year)` constraint (needed for LEFT JOIN 
null semantics).
   
   The new literal-substitution inference in 
`QueryPlanConstraints.inferAdditionalConstraints` substitutes `ss_sold_year → 
2000` into that `EqualNullSafe`, producing:
   
   ```scala
   EqualNullSafe(ws_sold_year, 2000)
   ```
   
   That inferred predicate is then pushed down through the `ws` CTE and into 
its `date_dim` dynamic-pruning subquery, where it becomes 
`EqualNullSafe(d_year, 2000)`.
   
   The original code never substituted literals into `EqualNullSafe`, so it 
never created this predicate. It only inferred `EqualTo(ws_sold_year, 2000)` 
from the `EqualTo(ws_sold_year, ss_sold_year)` constraint.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to