James Xu created SPARK-57437:
--------------------------------
Summary: Infer additional constraints by substituting
attribute=literal bindings
Key: SPARK-57437
URL: https://issues.apache.org/jira/browse/SPARK-57437
Project: Spark
Issue Type: Improvement
Components: Optimizer
Affects Versions: 4.3.0
Reporter: James Xu
When a join ON clause contains a range predicate that references an attribute
from the other side, and that attribute is bound to a literal elsewhere in the
query, Catalyst fails to
derive a pushable filter for the scan.
For example:
```sql
SELECT *
FROM a
LEFT JOIN b
ON a.key = b.key
AND b.pt >= f(a.pt)
WHERE a.pt = '20260610';
```
Because a.pt = '20260610' is known and the join condition contains b.pt >=
f(a.pt), the predicate b.pt >= f('20260610') logically holds for any matching
row from b. However, Catalyst
does not infer this bound, so the filter cannot be pushed into b's scan. As
a result, b is fully scanned even though only a small subset of partitions
could possibly match.
The same pattern occurs with other range or inequality predicates, such as:
```sql
SELECT *
FROM a
JOIN b
ON a.key = b.key
AND b.v >= a.k
AND b.v <= a.k + 10
WHERE a.k = 5;
```
In this case, any matching row from b must satisfy b.v >= 5 AND b.v <= 15,
but again Catalyst does not derive these bounds and scans all of b.
This leads to unnecessary full-table or full-partition scans, significantly
increasing I/O and execution time for workloads that join a large partitioned
table with a filtered
attribute.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]