[GitHub] [arrow-datafusion] xudong963 commented on pull request #1339: fix: filter pushdown with unsafe null

GitBox Mon, 29 Nov 2021 17:28:47 -0800


xudong963 commented on pull request #1339:
URL: https://github.com/apache/arrow-datafusion/pull/1339#issuecomment-982194847



   > > @xudong963 thank you for the examples
   > > Given this input:
   > > ```sql
   > > ❯ create table table1 as SELECT * FROM (VALUES (1), (2), (null)) as t;
   > > 0 rows in set. Query took 0.006 seconds.
   > > ❯ create table table2 as SELECT * FROM (VALUES (1), (3), (null)) as t;
   > > 0 rows in set. Query took 0.003 seconds.
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > The plan that comes out is incorrect:
   > > ```sql
   > > ❯ explain SELECT *
   > > FROM table1
   > > LEFT JOIN table2
   > >     ON table1.column1 = table2.column1
   > > WHERE table2.column1 IS NOT NULL;
   > > 
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
   > > | plan_type     | plan                                                   
                                                                               |
   > > 
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
   > > | logical_plan  | Projection: #table1.column1, #table2.column1           
                                                                               |
   > > |               |   Join: #table1.column1 = #table2.column1              
                                                                               |
   > > |               |     Filter: #table1.column1 IS NOT NULL                
                                                                               |
   > > |               |       TableScan: table1 projection=Some([0])           
                                                                               |
   > > |               |     Filter: #table2.column1 IS NOT NULL                
                                                                               |
   > > |               |       TableScan: table2 projection=Some([0])           
                                                                               |
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > I think the problem is this line:
   > > ```
   > > |               |     Filter: #table1.column1 IS NOT NULL      
   > > ```
   > 
   > ```sql
   > postgres=# select * from table1;
   >  column1
   > ---------
   >        1
   >        2
   > 
   > (3 rows)
   > 
   > postgres=# select * from table2;
   >  column1
   > ---------
   >        1
   >        3
   > 
   > (3 rows)
   > 
   > df
   > ❯ select * from table1;
   > +---------+
   > | column1 |
   > +---------+
   > | 1       |
   > | 2       |
   > |         |
   > +---------+
   > 3 rows in set. Query took 0.004 seconds.
   > ❯ select * from table2;
   > +---------+
   > | column1 |
   > +---------+
   > | 1       |
   > | 3       |
   > |         |
   > +---------+
   > 3 rows in set. Query took 0.004 seconds.
   > ```
   > 
   > I changed my code according to your comments
   > 
   > ```rust
   > fn optimize_join(
   >     mut state: State,
   >     plan: &LogicalPlan,
   >     left: &LogicalPlan,
   >     right: &LogicalPlan,
   >     join_type: Option<&JoinType>,
   > ) -> Result<LogicalPlan> {
   >     let (mut pushable_to_left, mut pushable_to_right, keep) =
   >         get_join_predicates(&state, left.schema(), right.schema());
   >     if join_type.is_some() {
   >         match join_type.unwrap() {
   >             JoinType::Left => pushable_to_left.clear(),
   >             JoinType::Right => pushable_to_right.clear(),
   >             JoinType::Full => {
   >                 pushable_to_left.clear();
   >                 pushable_to_right.clear()
   >             }
   >             _ => {}
   >         }
   >     }
   > ```
   > 
   > and got the following result:
   > 
   > ```sql
   > ❯ explain SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NOT NULL;
   > 
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
   > | plan_type     | plan                                                     
                                                                             |
   > 
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
   > | logical_plan  | Projection: #table1.column1, #table2.column1             
                                                                             |
   > |               |   Join: #table1.column1 = #table2.column1                
                                                                             |
   > |               |     TableScan: table1 projection=Some([0])               
                                                                             |
   > |               |     Filter: #table2.column1 IS NOT NULL                  
                                                                             |
   > |               |       TableScan: table2 projection=Some([0])             
                                                                             |
   > | physical_plan | ProjectionExec: expr=[column1@0 as column1, column1@1 as 
column1]                                                                     |
   > |               |   CoalesceBatchesExec: target_batch_size=4096            
                                                                             |
   > |               |     HashJoinExec: mode=Partitioned, join_type=Left, 
on=[(Column { name: "column1", index: 0 }, Column { name: "column1", index: 0 
})] |
   > |               |       CoalesceBatchesExec: target_batch_size=4096        
                                                                             |
   > |               |         RepartitionExec: partitioning=Hash([Column { 
name: "column1", index: 0 }], 8)                                                
 |
   > |               |           MemoryExec: partitions=8, partition_sizes=[1, 
0, 0, 0, 0, 0, 0, 0]                                                          |
   > |               |       CoalesceBatchesExec: target_batch_size=4096        
                                                                             |
   > |               |         RepartitionExec: partitioning=Hash([Column { 
name: "column1", index: 0 }], 8)                                                
 |
   > |               |           CoalesceBatchesExec: target_batch_size=4096    
                                                                             |
   > |               |             FilterExec: column1@0 IS NOT NULL            
                                                                             |
   > |               |               MemoryExec: partitions=8, 
partition_sizes=[1, 0, 0, 0, 0, 0, 0, 0]                                        
              |
   > |               |                                                          
                                                                             |
   > 
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
   > 2 rows in set. Query took 0.013 seconds.
   > ❯ SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = table2.column1 
WHERE table2.column1 IS NOT NULL;
   > +---------+---------+
   > | column1 | column1 |
   > +---------+---------+
   > | 2       |         |
   > |         |         |
   > | 1       | 1       |
   > +---------+---------+
   > 3 rows in set. Query took 0.014 seconds.
   > ```
   > 
   > But postgresql's result is different
   > 
   > ```sql
   > postgres=# explain SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 
= table2.column1 WHERE table2.column1 IS NOT NULL;
   >                               QUERY PLAN
   > ----------------------------------------------------------------------
   >  Merge Join  (cost=358.74..856.63 rows=32347 width=8)
   >    Merge Cond: (table2.column1 = table1.column1)
   >    ->  Sort  (cost=178.95..185.30 rows=2537 width=4)
   >          Sort Key: table2.column1
   >          ->  Seq Scan on table2  (cost=0.00..35.50 rows=2537 width=4)
   >                Filter: (column1 IS NOT NULL)
   >    ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   >          Sort Key: table1.column1
   >          ->  Seq Scan on table1  (cost=0.00..35.50 rows=2550 width=4)
   > (9 rows)
   > 
   > postgres=# SELECT * FROM table1 LEFT JOIN table2 ON table1.column1 = 
table2.column1 WHERE table2.column1 IS NOT NULL
   > postgres-# ;
   >  column1 | column1
   > ---------+---------
   >        1 |       1
   > (1 row)
   > ```
   
   I guess 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] xudong963 commented on pull request #1339: fix: filter pushdown with unsafe null

Reply via email to