James Xu created SPARK-57437:
--------------------------------

             Summary: Infer additional constraints by substituting 
attribute=literal bindings
                 Key: SPARK-57437
                 URL: https://issues.apache.org/jira/browse/SPARK-57437
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer
    Affects Versions: 4.3.0
            Reporter: James Xu


When a join ON clause contains a range predicate that references an attribute 
from the other side, and that attribute is bound to a literal elsewhere in the 
query, Catalyst fails to
   derive a pushable filter for the scan.

   For example:

   ```sql
     SELECT *
     FROM a
     LEFT JOIN b
       ON a.key = b.key
      AND b.pt >= f(a.pt)
     WHERE a.pt = '20260610';
   ```

   Because a.pt = '20260610' is known and the join condition contains b.pt >= 
f(a.pt), the predicate b.pt >= f('20260610') logically holds for any matching 
row from b. However, Catalyst
   does not infer this bound, so the filter cannot be pushed into b's scan. As 
a result, b is fully scanned even though only a small subset of partitions 
could possibly match.

   The same pattern occurs with other range or inequality predicates, such as:

   ```sql
     SELECT *
     FROM a
     JOIN b
       ON a.key = b.key
      AND b.v >= a.k
      AND b.v <= a.k + 10
     WHERE a.k = 5;
   ```

   In this case, any matching row from b must satisfy b.v >= 5 AND b.v <= 15, 
but again Catalyst does not derive these bounds and scans all of b.

   This leads to unnecessary full-table or full-partition scans, significantly 
increasing I/O and execution time for workloads that join a large partitioned 
table with a filtered
   attribute.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to