vanekjar opened a new pull request, #52699:
URL: https://github.com/apache/spark/pull/52699

   ## What changes were proposed in this pull request?
   
   This PR improves the Spark SQL optimizer’s `InferFiltersFromConstraints` 
rule to infer filter conditions from join constraints that involve complex 
expressions, not just simple attribute equalities.
   
   Currently, the optimizer can only infer additional constraints when the join 
condition is a simple equality (e.g., `a = b`). For more complex expressions, 
such as arithmetic operations, it does not infer the corresponding filter.
   
   ### Example (currently works as expected):
   
   ```sql
   SELECT *
   FROM t1
   JOIN t2 ON t1.a = t2.b
   WHERE t2.b = 1
   ```
   In this case, the optimizer correctly infers the additional constraint `t1.a 
= 1`.
   
   ### Example (now handled by this PR):
   
   ```sql
   SELECT *
   FROM t1
   JOIN t2 ON t1.a = t2.b + 2
   WHERE t2.b = 1
   ```
   Here, it is clear that `t1.a = 3` (since `t2.b = 1` and `t1.a = t2.b + 2`), 
but previously the optimizer did not infer this constraint. With this change, 
the optimizer can now deduce and push down `t1.a = 3`.
   
   ## How was this patch tested?
   
   You can reproduce and verify the improvement with the following:
   
   ```scala
   spark.sql("CREATE TABLE t1(a INT)")
   spark.sql("CREATE TABLE t2(b INT)")
   
   spark.sql("""
   SELECT * 
   FROM t1 
   INNER JOIN t2 ON t2.b = t1.a + 2 
   WHERE t1.a = 1
   """).explain
   ```
   
   **Before this change, the physical plan does not include the inferred 
filter:**
   ```
   == Physical Plan ==
   AdaptiveSparkPlan
   +- BroadcastHashJoin [(a#2 + 2)], [b#3], Inner, BuildRight, false
      :- Filter (isnotnull(a#2) AND (a#2 = 1))
      :  +- FileScan spark_catalog.default.t1[a#2]
         +- Filter isnotnull(b#3)
            +- FileScan spark_catalog.default.t2[b#3]
   ```
   
   **With this PR, the optimizer should infer and push down `t2.b = 3` as an 
additional filter.**
   
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- BroadcastHashJoin [(a#2 + 2)], [b#3], Inner, BuildRight, false
     :- Filter (isnotnull(a#2) AND (a#2 = 1))
     :  +- FileScan spark_catalog.default.t1[a#2]
     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)),false), [plan_id=27]
        +- Filter ((b#3 = 3) AND isnotnull(b#3))
           +- FileScan spark_catalog.default.t2[b#3]    
   ```
   
   ## Why are the changes needed?
   
   Without this enhancement, the optimizer cannot push down filters or optimize 
query execution plans for queries with complex join conditions, which can lead 
to suboptimal join performance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to