aguspdana opened a new issue #1321:
URL: https://github.com/apache/arrow-datafusion/issues/1321
**Describe the bug**
```sql
SELECT *
FROM left
LEFT JOIN right
ON left.a = right.b
WHERE right.b IS NULL;
```
This query returns only rows where `left.a` is `NULL`, which should also
return any rows where `left.a` don't exist on `right.b`.
**To Reproduce**
```rust
use std::sync::Arc;
use datafusion::arrow::array::{Array, Int32Array, StringArray};
use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::datasource::memory::MemTable;
use datafusion::prelude::*;
#[tokio::main]
async fn main() {
let mut ctx = ExecutionContext::new();
// Create table 1
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Int32, true),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from(vec![Some(1), Some(2), None]))]
).unwrap();
let table_a = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
ctx.register_table("table_a", Arc::new(table_a)).unwrap();
let dfa = ctx.table("table_a").unwrap();
println!("table_a");
dfa.clone().show().await.unwrap();
// Create table 2
let schema = Arc::new(Schema::new(vec![
Field::new("b", DataType::Int32, true),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from(vec![Some(1), Some(3), None]))]
).unwrap();
let table_b = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
ctx.register_table("table_b", Arc::new(table_b)).unwrap();
let dfb = ctx.table("table_b").unwrap();
println!("\ntable_b");
dfb.clone().show().await.unwrap();
println!("\nWith SQL LEFT JOIN WHERE b IS NULL:");
ctx
.sql("
SELECT *
FROM table_a
LEFT JOIN table_b
ON table_a.a = table_b.b
WHERE table_b.b IS NULL;
")
.await
.unwrap()
.show()
.await
.unwrap();
println!("\nWith DataFrame JoinType::Left and Expr::IsNull(b) filter:");
dfa.clone()
.join(
dfb.clone(),
JoinType::Left,
&["a"],
&["b"]
)
.unwrap()
.filter(col("b").is_null())
.unwrap()
.show()
.await
.unwrap();
println!("\nWith DataFrame JoinType::Anti:");
dfa
.join(
dfb,
JoinType::Anti,
&["a"],
&["b"]
)
.unwrap()
.show()
.await
.unwrap();
}
```
Result:
```
table_a
+---+
| a |
+---+
| 1 |
| 2 |
| |
+---+
table_b
+---+
| b |
+---+
| 1 |
| 3 |
| |
+---+
With SQL LEFT JOIN WHERE b IS NULL:
+---+---+
| a | b |
+---+---+
| | |
+---+---+
With DataFrame JoinType::Left and Expr::IsNull(b) filter:
+---+---+
| a | b |
+---+---+
| | |
+---+---+
With DataFrame JoinType::Anti:
+---+
| a |
+---+
| 2 |
| |
+---+
```
**Expected behavior**
On PostgreSQL this query returns
```
+---+---+
| a | b |
+---+---+
| 2 | |
| | |
+---+---+
```
**Additional context**
The logical plan of the SQL query:
```
Join: #table_a.a = #table_b.b
Filter: #table_a.a IS NULL
TableScan: table_a projection=Some([0])
Filter: #table_b.b IS NULL
TableScan: table_b projection=Some([0])
```
Looking at the logical plan I believe this is due to FilterPushDown (though
I'm not familiar with the optimizer yet).
I think we should make an exception for `IS NULL` and `IS NOT NULL` filter
because `LEFT JOIN` results in `NULL` for every missing value on the join
column of the right table.
Currently the workaround/prefered way when using the `DataFrame` API is to
use `JoinType::Semi` and `JoinType::Anti`.
The workaround SQL query is
```
SELECT *
FROM table_a
WHERE EXISTS (
SELECT 1
FROM table_b
WHERE table_a.a = table_b.b
);
```
But the logical plan is not implemented: `NotImplemented("Unsupported ast
node Exists(Query...`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]