alamb opened a new issue #1612:
URL: https://github.com/apache/arrow-datafusion/issues/1612
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
We want to bring some extra performance for certain predicates to the
dataframe API users as well.
Logically, a sql query like `SELECT * from A, B where A.a = B.b` can be
planned as a cross join and filter:
```
Filter(a = b)
Join (Cross)
TableScan A
TableScan B
```
Almost no database actually does this because of how expensive a CROSS join
is, and instead uses a specialized "Inner Join" operator.
In DataFusion, The [SQL planner
](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs)
has code to recognize certain predicates such as `a = b` as "join predicates"
However, this logic is not applied to the DataFrame API where one makes a
plan like
```rust
let df = ctx.read_table(A)
.join(ctx.read_table(B), JoinType::Inner, col("a"), col("b"))
```
Which is somewhat less than ergonomic
It would be cool if it were possible to make a plan like
```rust
let df = ctx.read_table(A)
.join(ctx.read_table(B), JoinType::Cross)
.filter(col("a").eq(col("b")))
```
And have the DataFusion optimizer rewrite the CrossJoin into a InnerJoin
**Describe the solution you'd like**
I was thinking it might be possible to extend `FilterPushdown` in a way that
would try and push down `column = column` type expressions from `Filter` nodes
into `Join` nodes. So for example
Rewrite
```
Filter(a = b)
Join (Cross)
TableScan A
TableScan B
```
To
```
Join (Inner) exprs: {a = b}
TableScan A
TableScan B
```
And rewrite
```
Filter(a = b)
Join (Inner) exprs: {a2 = b2}
TableScan A
TableScan B
```
to
```
Filter(a = b)
Join (Inner) exprs: {a2 = b2 AND a = b}
TableScan A
TableScan B
```
**Describe alternatives you've considered**
Maybe this should be its own pass
**Additional context**
Suggested by @houqp . See
https://github.com/apache/arrow-datafusion/pull/1566 and
https://github.com/apache/arrow-datafusion/issues/1566#issuecomment-1015733848
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]