james727 opened a new pull request #1618: URL: https://github.com/apache/arrow-datafusion/pull/1618
# Which issue does this PR close? Closes https://github.com/apache/arrow-datafusion/issues/1586 # What changes are included in this PR? This change does a couple of things: 1. Does not attempt to push down any predicates for the nullable (non-preserved) side of a given join 2. Bails on the optimization to duplicate post-join filters on a column involved in the join equality to the other side of the join, if the other side of the join is nullable. I'll elaborate on point 2 here as I think the above is a bit difficult to parse. Consider the following query: ```sql SELECT * FROM t1 JOIN t2 on t1.id = t2.uid WHERE t1.id > 1 ``` Right now this is being rewritten as: ```sql SELECT * FROM t1 JOIN t2 on t1.id = t2.uid WHERE t1.id > 1 AND t2.uid > 1 -- Duplicate filter to t2.uid, so it can be pushed down ``` This is correct in the case of certain joins. However, I believe that in cases where the other join side is non-preserved, we should not apply this optimization. For example, if we change the above join to a `LEFT` join, the query rewrite becomes: ```sql SELECT * FROM t1 LEFT JOIN t2 on t1.id = t2.uid WHERE t1.id > 1 AND t2.uid > 1 ``` The additional filter on `t2.uid` is wrong - as it omits null rows that would have been present in the original query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
