aguspdana opened a new issue #1321:
URL: https://github.com/apache/arrow-datafusion/issues/1321


   **Describe the bug**
   
   ```sql
   SELECT *
   FROM left
   LEFT JOIN right
       ON left.a = right.b
   WHERE right.b IS NULL;
   ```
   
   This query returns only rows where `left.a` is `NULL`, which should also 
return any rows where `left.a` don't exist on `right.b`.
   
   
   **To Reproduce**
   
   ```rust
   use std::sync::Arc;
   use datafusion::arrow::array::{Array, Int32Array, StringArray};
   use datafusion::arrow::datatypes::{DataType, Field, Schema};
   use datafusion::arrow::record_batch::RecordBatch;
   use datafusion::datasource::memory::MemTable;
   use datafusion::prelude::*;
   
   #[tokio::main]
   async fn main() {
       let mut ctx = ExecutionContext::new();
   
       // Create table 1
       let schema = Arc::new(Schema::new(vec![
           Field::new("a", DataType::Int32, true),
       ]));
       let batch = RecordBatch::try_new(
           schema.clone(),
           vec![Arc::new(Int32Array::from(vec![Some(1), Some(2), None]))]
       ).unwrap();
       let table_a = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
       ctx.register_table("table_a", Arc::new(table_a)).unwrap();
       let dfa = ctx.table("table_a").unwrap();
       println!("table_a");
       dfa.clone().show().await.unwrap();
   
       // Create table 2
       let schema = Arc::new(Schema::new(vec![
           Field::new("b", DataType::Int32, true),
       ]));
       let batch = RecordBatch::try_new(
           schema.clone(),
           vec![Arc::new(Int32Array::from(vec![Some(1), Some(3), None]))]
       ).unwrap();
       let table_b = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
       ctx.register_table("table_b", Arc::new(table_b)).unwrap();
       let dfb = ctx.table("table_b").unwrap();
       println!("\ntable_b");
       dfb.clone().show().await.unwrap();
   
       println!("\nWith SQL LEFT JOIN WHERE b IS NULL:");
       ctx
           .sql("
               SELECT *
               FROM table_a
               LEFT JOIN table_b 
                   ON table_a.a = table_b.b
               WHERE table_b.b IS NULL;
           ")
           .await
           .unwrap()
           .show()
           .await
           .unwrap();
   
       println!("\nWith DataFrame JoinType::Left and Expr::IsNull(b) filter:");
       dfa.clone()
           .join(
               dfb.clone(),
               JoinType::Left,
               &["a"],
               &["b"]
           )
           .unwrap()
           .filter(col("b").is_null())
           .unwrap()
           .show()
           .await
           .unwrap();
   
       println!("\nWith DataFrame JoinType::Anti:");
       dfa
           .join(
               dfb,
               JoinType::Anti,
               &["a"],
               &["b"]
           )
           .unwrap()
           .show()
           .await
           .unwrap();
   }
   ```
   
   Result:
   
   ```
   table_a
   +---+
   | a |
   +---+
   | 1 |
   | 2 |
   |   |
   +---+
   
   table_b
   +---+
   | b |
   +---+
   | 1 |
   | 3 |
   |   |
   +---+
   
   With SQL LEFT JOIN WHERE b IS NULL:
   +---+---+
   | a | b |
   +---+---+
   |   |   |
   +---+---+
   
   With DataFrame JoinType::Left and Expr::IsNull(b) filter:
   +---+---+
   | a | b |
   +---+---+
   |   |   |
   +---+---+
   
   With DataFrame JoinType::Anti:
   +---+
   | a |
   +---+
   | 2 |
   |   |
   +---+
   ```
   
   
   **Expected behavior**
   
   On PostgreSQL this query returns
   ```
   +---+---+
   | a | b |
   +---+---+
   | 2 |   |
   |   |   |
   +---+---+
   ```
   
   
   **Additional context**
   
   The logical plan of the SQL query:
   
   ```
   Join: #table_a.a = #table_b.b
     Filter: #table_a.a IS NULL
       TableScan: table_a projection=Some([0])
     Filter: #table_b.b IS NULL
       TableScan: table_b projection=Some([0])
   ```
   
   Looking at the logical plan I believe this is due to FilterPushDown (though 
I'm not familiar with the optimizer yet).
   I think we should make an exception for `IS NULL` and `IS NOT NULL` filter 
because `LEFT JOIN` results in `NULL` for every missing value on the join 
column of the right table.
   
   Currently the workaround/prefered way when using the `DataFrame` API is to 
use `JoinType::Semi` and `JoinType::Anti`.
   
   The workaround SQL query is
   ```
   SELECT *                                                                     
                 
   FROM table_a                                                                 
                 
   WHERE EXISTS (
       SELECT 1               
       FROM table_b              
       WHERE table_a.a = table_b.b 
   );
   ```
   
   But the logical plan is not implemented: `NotImplemented("Unsupported ast 
node Exists(Query...`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to