xiedeyantu opened a new issue, #22369:
URL: https://github.com/apache/datafusion/issues/22369
### Is your feature request related to a problem or challenge?
Currently, DataFusion handles conjunctive join conditions well. For example,
conditions like:
```
a.id = b.id AND a.v > 10
```
can already be optimized by extracting equijoin keys while preserving the
remaining predicates as join filters.
However, for queries such as:
```
SELECT *
FROM t1
INNER JOIN t2
ON t1.id = t2.id OR t1.age = t2.age
```
the presence of an OR condition prevents join key extraction. As a result,
the optimizer typically falls back to a single inner join with a complex filter
condition, making it difficult for later optimization stages and the execution
engine to take advantage of equijoin-based optimizations. Therefore, this type
of join will be executed using nested loop join.
This proposal introduces a new logical optimizer rule that rewrites:
```
t1 INNER JOIN t2 ON cond1 OR cond2 OR ... OR condN
```
into:
```
SELECT * FROM (
(t1 INNER JOIN t2 ON cond1)
UNION ALL
(t1 INNER JOIN t2 ON cond2 AND NOT cond1)
UNION ALL
(t1 INNER JOIN t2 ON cond3 AND NOT cond1 AND NOT cond2)
...
)
```
Each branch becomes mutually exclusive, preserving the original semantics
while exposing simpler join conditions that can be optimized independently.
Initial Scope
The first implementation only targets the following cases:
- Only INNER JOIN
- Only when the join condition contains a top-level splittable OR
- Logical plan rewrite only; no execution-layer changes
- No support for outer joins, semi/anti joins, or full joins
- No join reordering across multi-level join trees; only rewrite the current
join node
Rewrite Example
Input:
```
SELECT *
FROM t1
INNER JOIN t2
ON t1.id = t2.id OR t1.age = t2.age
```
Rewritten as:
```
SELECT *
FROM (
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id
UNION ALL
SELECT * FROM t1 INNER JOIN t2
ON t1.age = t2.age AND NOT (t1.id = t2.id)
)
```
### Describe the solution you'd like
_No response_
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]