james727 commented on a change in pull request #1618:
URL: https://github.com/apache/arrow-datafusion/pull/1618#discussion_r789951842
##########
File path: datafusion/src/optimizer/filter_push_down.rs
##########
@@ -1139,6 +1136,172 @@ mod tests {
Ok(())
}
+ /// post-join predicates on the right side of a left join are duplicated
and pushed down
+ /// to the left side.
+ #[test]
+ fn filter_using_left_join() -> Result<()> {
+ let table_scan = test_table_scan()?;
+ let left = LogicalPlanBuilder::from(table_scan).build()?;
+ let right_table_scan = test_table_scan_with_name("test2")?;
+ let right = LogicalPlanBuilder::from(right_table_scan)
+ .project(vec![col("a")])?
+ .build()?;
+ let plan = LogicalPlanBuilder::from(left)
+ .join_using(
+ &right,
+ JoinType::Left,
+ vec![Column::from_name("a".to_string())],
+ )?
+ .filter(col("test2.a").lt_eq(lit(1i64)))?
+ .build()?;
+
+ // not part of the test, just good to know:
+ assert_eq!(
+ format!("{:?}", plan),
+ "\
+ Filter: #test2.a <= Int64(1)\
+ \n Join: Using #test.a = #test2.a\
+ \n TableScan: test projection=None\
+ \n Projection: #test2.a\
+ \n TableScan: test2 projection=None"
+ );
+
+ // filter duplicated and pushed down to the left
+ let expected = "\
+ Filter: #test2.a <= Int64(1)\
+ \n Join: Using #test.a = #test2.a\
+ \n Filter: #test.a <= Int64(1)\
Review comment:
Oh man, great catch! I think this gets back to some previous discussion
around filters that preserve nulls vs filters that do not preserve nulls. It
seems this filter duplication logic is valid for filters that do not preserve
nulls (e.g. - in this test case), but breaks when the filter preserves nulls
(as shown in your example above).
E.g. - correct me if I'm wrong I _think_ that duplicating the filter is
correct in the following query:
```
SELECT *
FROM t1 LEFT JOIN t2
ON t1.x = t2.y
WHERE t2.y > 5 -- Can duplicate filter for t1.x > 5
```
What do you think about going back to the logic in [this
commit](https://github.com/apache/arrow-datafusion/pull/1618/commits/c80772b5ce2c1a6419cefb247d55ec9ef2ef9dbd)
that just skips this optimization for non-inner joins (before I got fancy with
it). If a filter does not preserve nulls, we will be able to rewrite the join
as an `INNER` join anyways, so once that optimization is implemented we'll get
the correct duplication logic with another pass of this optimizer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]