[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5345: Refactor DecorrelateWhereExists and add back Distinct if needs

via GitHub Thu, 09 Mar 2023 20:14:44 -0800


mingmwang commented on code in PR #5345:
URL: https://github.com/apache/arrow-datafusion/pull/5345#discussion_r1131929489



##########
datafusion/optimizer/src/decorrelate_where_exists.rs:
##########
@@ -670,4 +673,76 @@ mod tests {
 
         assert_plan_eq(&plan, expected)
     }
+
+    #[test]
+    fn exists_distinct_subquery() -> Result<()> {
+        let table_scan = test_table_scan()?;
+        let subquery_scan = test_table_scan_with_name("sq")?;
+        let subquery = LogicalPlanBuilder::from(subquery_scan)
+            .filter((lit(1u32) + col("sq.a")).gt(col("test.a") * lit(2u32)))?
+            .project(vec![col("sq.c")])?
+            .distinct()?
+            .build()?;
+        let plan = LogicalPlanBuilder::from(table_scan)
+            .filter(exists(Arc::new(subquery)))?
+            .project(vec![col("test.b")])?
+            .build()?;
+
+        let expected = "Projection: test.b [b:UInt32]\
+                        \n  LeftSemi Join:  Filter: UInt32(1) + sq.a > test.a 
* UInt32(2) [a:UInt32, b:UInt32, c:UInt32]\
+                        \n    TableScan: test [a:UInt32, b:UInt32, c:UInt32]\
+                        \n    Distinct: [a:UInt32]\
+                        \n      Projection: sq.a [a:UInt32]\
+                        \n        TableScan: sq [a:UInt32, b:UInt32, 
c:UInt32]";
+
+        assert_plan_eq(&plan, expected)
+    }
+
+    #[test]
+    fn exists_distinct_expr_subquery() -> Result<()> {
+        let table_scan = test_table_scan()?;
+        let subquery_scan = test_table_scan_with_name("sq")?;
+        let subquery = LogicalPlanBuilder::from(subquery_scan)
+            .filter((lit(1u32) + col("sq.a")).gt(col("test.a") * lit(2u32)))?
+            .project(vec![col("sq.b") + col("sq.c")])?
+            .distinct()?
+            .build()?;
+        let plan = LogicalPlanBuilder::from(table_scan)
+            .filter(exists(Arc::new(subquery)))?
+            .project(vec![col("test.b")])?
+            .build()?;
+
+        let expected = "Projection: test.b [b:UInt32]\
+                        \n  LeftSemi Join:  Filter: UInt32(1) + sq.a > test.a 
* UInt32(2) [a:UInt32, b:UInt32, c:UInt32]\
+                        \n    TableScan: test [a:UInt32, b:UInt32, c:UInt32]\
+                        \n    Distinct: [a:UInt32]\
+                        \n      Projection: sq.a [a:UInt32]\
+                        \n        TableScan: sq [a:UInt32, b:UInt32, 
c:UInt32]";
+
+        assert_plan_eq(&plan, expected)
+    }

Review Comment:
   > I find the behaviors of `Distinct` in postgres and spark are not same.
   > 
   > * For postgres, it will not add back the distinct to the optimized result.
   > * For spark, it will add back the distinct, and will keep the unused 
project exprs(`sq.b + sq.c` in the above).
   > 
   
   @ygf11 @jackwener @alamb 
   I guess why Postgres remove the `Distinct` in the final result is maybe it 
is related to how it implements NAAJ(Null Aware Anti Join) and to make sure the 
result is correct.
   
   You can test try those SQLs in Postgres:
   
   ```
   CREATE TABLE t1 (id integer,name varchar(100));
   CREATE TABLE t2 (id integer,name varchar(100));
   ```
   ---------------------------------
   ```
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT distinct * FROM t2 WHERE 
t2.id = t1.id);
   ```
   
   ```
   QUERY PLAN
   Hash Join (cost=18.50..34.32 rows=160 width=222)
   Hash Cond: (t1.id = t2.id)
   -> Seq Scan on t1 (cost=0.00..13.20 rows=320 width=222)
   -> Hash (cost=16.00..16.00 rows=200 width=4)
   -> HashAggregate (cost=14.00..16.00 rows=200 width=4)
   Group Key: t2.id
   -> Seq Scan on t2 (cost=0.00..13.20 rows=320 width=4)
   ```
   ---------------------------------
   ```
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE NOT EXISTS (SELECT distinct t2.name FROM 
t2 WHERE t2.id = t1.id);
   ```
   
   ```
   QUERY PLAN
   --
   Hash Anti Join (cost=17.20..35.82 rows=160 width=222)
   Hash Cond: (t1.id = t2.id)
   -> Seq Scan on t1 (cost=0.00..13.20 rows=320 width=222)
   -> Hash (cost=13.20..13.20 rows=320 width=4)
   -> Seq Scan on t2 (cost=0.00..13.20 rows=320 width=4)
   ```
   ---------------------------------
   ```
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT distinct t2.id FROM t2);
   ```
   
   ```
   QUERY PLAN
   --
   Hash Join (cost=20.50..34.56 rows=320 width=222)
   Hash Cond: (t1.id = t2.id)
   -> Seq Scan on t1 (cost=0.00..13.20 rows=320 width=222)
   -> Hash (cost=18.00..18.00 rows=200 width=4)
   -> HashAggregate (cost=14.00..16.00 rows=200 width=4)
   Group Key: t2.id
   -> Seq Scan on t2 (cost=0.00..13.20 rows=320 width=4)
   ```
   ---------------------------------
   ```
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE t1.id not in (SELECT distinct t2.id FROM 
t2);
   ```
   
   ```
   QUERY PLAN
   --
   Seq Scan on t1 (cost=16.50..30.50 rows=160 width=222)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
   -> HashAggregate (cost=14.00..16.00 rows=200 width=4)
   Group Key: t2.id
   -> Seq Scan on t2 (cost=0.00..13.20 rows=320 width=4
   
   ```
   
   You can see that for `IN`/`EXISTS` subqueries, there are `Aggregates`. But 
for `NOT EXISTS`, the Aggregate` is removed, and for `NOT IN`, the subquery is 
kept and not decorrelated.  
   
   I think in DataFusion, I like the implementation in this PR and the 
generated plan is more consistent. For the correctness of NAAJ,  we have 
another ticket to track this and we can make the Hash Join itself Null aware.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5345: Refactor DecorrelateWhereExists and add back Distinct if needs

Reply via email to