[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1339: fix: filter pushdown with unsafe null

GitBox Sun, 21 Nov 2021 11:39:06 -0800


xudong963 commented on a change in pull request #1339:
URL: https://github.com/apache/arrow-datafusion/pull/1339#discussion_r753654701




##########
File path: datafusion/src/optimizer/filter_push_down.rs
##########
@@ -408,10 +412,20 @@ fn optimize(plan: &LogicalPlan, mut state: State) -> 
Result<LogicalPlan> {
             //
             // Join clauses with `Using` constraints also take advantage of 
this logic to make sure
             // predicates reference the shared join columns are pushed to both 
sides.
+            let mut null_unsafe_predicates = vec![];
             let join_side_filters = state
                 .filters
                 .iter()
                 .filter_map(|(predicate, columns)| {
+                    if join_type == &JoinType::Left
+                        || join_type == &JoinType::Right
+                        || join_type == &JoinType::Full
+                    {
+                        if let Expr::IsNull(..) | Expr::IsNotNull(..) = 
predicate {

Review comment:
       > I don't think we can push _ANY_ filers from the `ON` clause down the 
non preserved side(s) of an outer join -- in other words, this is not a problem 
with `IS NULL` and `IS NOT NULL`.
   
   Yes, I agree. Luckily, we don't implement the case. 
https://github.com/apache/arrow-datafusion/blob/0facd4d483e8c289ee4e3a89487d0cd1ede1a110/datafusion/src/sql/planner.rs#L564
   
   However, I think there is a problem with `IS NULL` and `IS NOT NULL` in 
`WHERE` clause, we shouldn't push them down.

##########
File path: datafusion/tests/sql.rs
##########
@@ -5999,3 +5999,86 @@ async fn test_expect_distinct() -> Result<()> {
     assert_batches_eq!(expected, &actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn test_predicate_push_down_with_unsafe_null() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    let create_table1_sql =
+        "create table table1 as SELECT * FROM (VALUES (1), (2), (null)) as t";
+    ctx.sql(create_table1_sql).await.unwrap();
+    let create_table2_sql =
+        "create table table2 as SELECT * FROM (VALUES (1), (3), (null)) as t";
+    ctx.sql(create_table2_sql).await.unwrap();
+    // left join with is_not_null filter
+    let sql = "SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NOT NULL";
+    let actual = execute_to_batches(&mut ctx, sql).await;
+    let expected = vec![
+        "+---------+---------+",
+        "| column1 | column1 |",
+        "+---------+---------+",
+        "| 1       | 1       |",
+        "+---------+---------+",
+    ];
+    assert_batches_eq!(expected, &actual);
+    // left join with is_null filter
+    let sql = "SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NULL ORDER BY table1.column1";
+    let actual = execute_to_batches(&mut ctx, sql).await;
+    let expected = vec![
+        "+---------+---------+",
+        "| column1 | column1 |",
+        "+---------+---------+",
+        "|         |         |",
+        "| 2       |         |",
+        "+---------+---------+",

Review comment:
       This is because currently DF is by default nulls first to be consistent 
with spark. I'll pull a request to make it be consistent with postgresql. 
   
   FYI, pg rule: https://www.postgresql.org/docs/current/queries-order.html
   > The NULLS FIRST and NULLS LAST options can be used to determine whether 
nulls appear before or after non-null values in the sort ordering. By default, 
null values sort as if larger than any non-null value; that is, NULLS FIRST is 
the default for DESC order, and NULLS LAST otherwise.

##########
File path: datafusion/tests/sql.rs
##########
@@ -5999,3 +5999,86 @@ async fn test_expect_distinct() -> Result<()> {
     assert_batches_eq!(expected, &actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn test_predicate_push_down_with_unsafe_null() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    let create_table1_sql =
+        "create table table1 as SELECT * FROM (VALUES (1), (2), (null)) as t";
+    ctx.sql(create_table1_sql).await.unwrap();
+    let create_table2_sql =
+        "create table table2 as SELECT * FROM (VALUES (1), (3), (null)) as t";
+    ctx.sql(create_table2_sql).await.unwrap();
+    // left join with is_not_null filter
+    let sql = "SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NOT NULL";
+    let actual = execute_to_batches(&mut ctx, sql).await;
+    let expected = vec![
+        "+---------+---------+",
+        "| column1 | column1 |",
+        "+---------+---------+",
+        "| 1       | 1       |",
+        "+---------+---------+",
+    ];
+    assert_batches_eq!(expected, &actual);
+    // left join with is_null filter
+    let sql = "SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NULL ORDER BY table1.column1";
+    let actual = execute_to_batches(&mut ctx, sql).await;
+    let expected = vec![
+        "+---------+---------+",
+        "| column1 | column1 |",
+        "+---------+---------+",
+        "|         |         |",
+        "| 2       |         |",
+        "+---------+---------+",

Review comment:
       This is because currently, DF is by default nulls first to be consistent 
with spark. I'll pull a request to make it be consistent with postgresql. 
   
   FYI, pg rule: https://www.postgresql.org/docs/current/queries-order.html
   > The NULLS FIRST and NULLS LAST options can be used to determine whether 
nulls appear before or after non-null values in the sort ordering. By default, 
null values sort as if larger than any non-null value; that is, NULLS FIRST is 
the default for DESC order, and NULLS LAST otherwise.

##########
File path: datafusion/src/optimizer/filter_push_down.rs
##########
@@ -408,10 +412,20 @@ fn optimize(plan: &LogicalPlan, mut state: State) -> 
Result<LogicalPlan> {
             //
             // Join clauses with `Using` constraints also take advantage of 
this logic to make sure
             // predicates reference the shared join columns are pushed to both 
sides.
+            let mut null_unsafe_predicates = vec![];
             let join_side_filters = state
                 .filters
                 .iter()
                 .filter_map(|(predicate, columns)| {
+                    if join_type == &JoinType::Left
+                        || join_type == &JoinType::Right
+                        || join_type == &JoinType::Full
+                    {
+                        if let Expr::IsNull(..) | Expr::IsNotNull(..) = 
predicate {

Review comment:
       > I don't think we can push _ANY_ filers from the `ON` clause down the 
non preserved side(s) of an outer join -- in other words, this is not a problem 
with `IS NULL` and `IS NOT NULL`.
   
   Yes, I agree. Luckily, we don't implement the case. 
https://github.com/apache/arrow-datafusion/blob/0facd4d483e8c289ee4e3a89487d0cd1ede1a110/datafusion/src/sql/planner.rs#L564
   
   However, I think there is a problem with `IS NULL` and `IS NOT NULL` in 
`WHERE` clause with an outer join, we shouldn't push them down.

##########
File path: datafusion/tests/sql.rs
##########
@@ -5999,3 +5999,86 @@ async fn test_expect_distinct() -> Result<()> {
     assert_batches_eq!(expected, &actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn test_predicate_push_down_with_unsafe_null() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    let create_table1_sql =
+        "create table table1 as SELECT * FROM (VALUES (1), (2), (null)) as t";
+    ctx.sql(create_table1_sql).await.unwrap();
+    let create_table2_sql =
+        "create table table2 as SELECT * FROM (VALUES (1), (3), (null)) as t";
+    ctx.sql(create_table2_sql).await.unwrap();
+    // left join with is_not_null filter
+    let sql = "SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NOT NULL";
+    let actual = execute_to_batches(&mut ctx, sql).await;
+    let expected = vec![
+        "+---------+---------+",
+        "| column1 | column1 |",
+        "+---------+---------+",
+        "| 1       | 1       |",
+        "+---------+---------+",
+    ];
+    assert_batches_eq!(expected, &actual);
+    // left join with is_null filter
+    let sql = "SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NULL ORDER BY table1.column1";
+    let actual = execute_to_batches(&mut ctx, sql).await;
+    let expected = vec![
+        "+---------+---------+",
+        "| column1 | column1 |",
+        "+---------+---------+",
+        "|         |         |",
+        "| 2       |         |",
+        "+---------+---------+",

Review comment:
       > I'll pull a request to make it be consistent with postgresql.
   
   Here: https://github.com/apache/arrow-datafusion/pull/1344

##########
File path: datafusion/src/optimizer/filter_push_down.rs
##########
@@ -408,10 +412,20 @@ fn optimize(plan: &LogicalPlan, mut state: State) -> 
Result<LogicalPlan> {
             //
             // Join clauses with `Using` constraints also take advantage of 
this logic to make sure
             // predicates reference the shared join columns are pushed to both 
sides.
+            let mut null_unsafe_predicates = vec![];
             let join_side_filters = state
                 .filters
                 .iter()
                 .filter_map(|(predicate, columns)| {
+                    if join_type == &JoinType::Left
+                        || join_type == &JoinType::Right
+                        || join_type == &JoinType::Full
+                    {
+                        if let Expr::IsNull(..) | Expr::IsNotNull(..) = 
predicate {

Review comment:
       > I don't think we can push _ANY_ filers from the `ON` clause down the 
non preserved side(s) of an outer join -- in other words, this is not a problem 
with `IS NULL` and `IS NOT NULL`.
   
   Yes, I agree. Luckily, we don't implement the case. 
https://github.com/apache/arrow-datafusion/blob/0facd4d483e8c289ee4e3a89487d0cd1ede1a110/datafusion/src/sql/planner.rs#L564
   
   However, I think there is a problem with `IS NULL` and `IS NOT NULL` in 
`WHERE` clause with an outer join, we shouldn't push them down.  I think the 
ticket is solving the problem.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1339: fix: filter pushdown with unsafe null

Reply via email to