james727 opened a new pull request #1618:
URL: https://github.com/apache/arrow-datafusion/pull/1618


   # Which issue does this PR close?
   Closes https://github.com/apache/arrow-datafusion/issues/1586
   
   # What changes are included in this PR?
   This change does a couple of things:
   1. Does not attempt to push down any predicates for the nullable 
(non-preserved) side of a given join
   2. Bails on the optimization to duplicate post-join filters on a column 
involved in the join equality to the other side of the join, if the other side 
of the join is nullable.
   
   I'll elaborate on point 2 here as I think the above is a bit difficult to 
parse. Consider the following query:
   ```sql
   SELECT *
   FROM t1 JOIN t2 
     on t1.id = t2.uid 
   WHERE t1.id > 1
   ```
   Right now this is being rewritten as:
   ```sql
   SELECT *
   FROM t1 JOIN t2 
     on t1.id = t2.uid 
   WHERE t1.id > 1 AND t2.uid > 1 -- Duplicate filter to t2.uid, so it can be 
pushed down
   ```
   This is correct in the case of certain joins. However, I believe that in 
cases where the other join side is non-preserved, we should not apply this 
optimization. For example, if we change the above join to a `LEFT` join, the 
query rewrite becomes:
   ```sql
   SELECT *
   FROM t1 LEFT JOIN t2 
     on t1.id = t2.uid 
   WHERE t1.id > 1 AND t2.uid > 1
   ```
   The additional filter on `t2.uid` is wrong - as it omits null rows that 
would have been present in the original query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to