vanekjar opened a new pull request, #52699:
URL: https://github.com/apache/spark/pull/52699
## What changes were proposed in this pull request?
This PR improves the Spark SQL optimizer’s `InferFiltersFromConstraints`
rule to infer filter conditions from join constraints that involve complex
expressions, not just simple attribute equalities.
Currently, the optimizer can only infer additional constraints when the join
condition is a simple equality (e.g., `a = b`). For more complex expressions,
such as arithmetic operations, it does not infer the corresponding filter.
### Example (currently works as expected):
```sql
SELECT *
FROM t1
JOIN t2 ON t1.a = t2.b
WHERE t2.b = 1
```
In this case, the optimizer correctly infers the additional constraint `t1.a
= 1`.
### Example (now handled by this PR):
```sql
SELECT *
FROM t1
JOIN t2 ON t1.a = t2.b + 2
WHERE t2.b = 1
```
Here, it is clear that `t1.a = 3` (since `t2.b = 1` and `t1.a = t2.b + 2`),
but previously the optimizer did not infer this constraint. With this change,
the optimizer can now deduce and push down `t1.a = 3`.
## How was this patch tested?
You can reproduce and verify the improvement with the following:
```scala
spark.sql("CREATE TABLE t1(a INT)")
spark.sql("CREATE TABLE t2(b INT)")
spark.sql("""
SELECT *
FROM t1
INNER JOIN t2 ON t2.b = t1.a + 2
WHERE t1.a = 1
""").explain
```
**Before this change, the physical plan does not include the inferred
filter:**
```
== Physical Plan ==
AdaptiveSparkPlan
+- BroadcastHashJoin [(a#2 + 2)], [b#3], Inner, BuildRight, false
:- Filter (isnotnull(a#2) AND (a#2 = 1))
: +- FileScan spark_catalog.default.t1[a#2]
+- Filter isnotnull(b#3)
+- FileScan spark_catalog.default.t2[b#3]
```
**With this PR, the optimizer should infer and push down `t2.b = 3` as an
additional filter.**
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [(a#2 + 2)], [b#3], Inner, BuildRight, false
:- Filter (isnotnull(a#2) AND (a#2 = 1))
: +- FileScan spark_catalog.default.t1[a#2]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int,
false] as bigint)),false), [plan_id=27]
+- Filter ((b#3 = 3) AND isnotnull(b#3))
+- FileScan spark_catalog.default.t2[b#3]
```
## Why are the changes needed?
Without this enhancement, the optimizer cannot push down filters or optimize
query execution plans for queries with complex join conditions, which can lead
to suboptimal join performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]