alamb opened a new issue #1612:
URL: https://github.com/apache/arrow-datafusion/issues/1612


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   We want to bring some extra performance for certain predicates to the 
dataframe API users as well. 
   
   Logically, a sql query like `SELECT * from A, B where A.a = B.b` can be 
planned as a cross join and filter:
   
   ```
   Filter(a = b)
     Join (Cross)
       TableScan A
       TableScan B
   ```
   
   Almost no database actually does this because of how expensive a CROSS join 
is, and instead uses a specialized "Inner Join" operator.
   
   In DataFusion, The [SQL planner 
](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs)
 has code to recognize certain predicates such as `a = b` as "join predicates" 
   
   However, this logic is not applied to the DataFrame API where one makes a 
plan like
   ```rust
   let df = ctx.read_table(A)
       .join(ctx.read_table(B), JoinType::Inner, col("a"), col("b"))
   ```
   Which is somewhat less than ergonomic
   
   It would be cool if it were possible to make a plan like
   ```rust
   let df = ctx.read_table(A)
       .join(ctx.read_table(B), JoinType::Cross)
       .filter(col("a").eq(col("b")))
   ```
   
   And have the DataFusion optimizer rewrite the CrossJoin into a InnerJoin
   
   
   
   
   **Describe the solution you'd like**
   
   I was thinking it might be possible to extend `FilterPushdown` in a way that 
would try and push down `column = column` type expressions from `Filter` nodes 
into `Join` nodes. So for example
   
   Rewrite
   ```
   Filter(a = b)
     Join (Cross)
       TableScan A
       TableScan B
   ```
   To 
   ```
     Join (Inner) exprs: {a = b}
       TableScan A
       TableScan B
   ```
   
   And rewrite 
   ```
   Filter(a = b)
     Join (Inner) exprs: {a2 = b2}
       TableScan A
       TableScan B
   ```
   to
   ```
   Filter(a = b)
     Join (Inner) exprs: {a2 = b2 AND a = b}
       TableScan A
       TableScan B
   ```
   
   **Describe alternatives you've considered**
   Maybe this should be its own pass
   
   **Additional context**
   Suggested by @houqp . See 
https://github.com/apache/arrow-datafusion/pull/1566 and  
https://github.com/apache/arrow-datafusion/issues/1566#issuecomment-1015733848


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to