[GitHub] [arrow-datafusion] jiangzhx commented on a diff in pull request #5907: optimize non-correlated where exists subquery rewrite to scalar subquery

via GitHub Fri, 14 Apr 2023 03:18:04 -0700


jiangzhx commented on code in PR #5907:
URL: https://github.com/apache/arrow-datafusion/pull/5907#discussion_r1166648579



##########
datafusion/core/tests/sql/subqueries.rs:
##########
@@ -129,12 +129,14 @@ async fn exists_subquery_with_same_table() -> Result<()> {
 
     let expected = vec![
         "Explain [plan_type:Utf8, plan:Utf8]",
-        "  Filter: EXISTS (<subquery>) [t1_id:UInt32;N, t1_name:Utf8;N, 
t1_int:UInt32;N]",
-        "    Subquery: [t1_int:UInt32;N]",
-        "      Projection: t1.t1_int [t1_int:UInt32;N]",
-        "        Filter: t1.t1_id > t1.t1_int [t1_id:UInt32;N, t1_name:Utf8;N, 
t1_int:UInt32;N]",
-        "          TableScan: t1 [t1_id:UInt32;N, t1_name:Utf8;N, 
t1_int:UInt32;N]",
-        "    TableScan: t1 projection=[t1_id, t1_name, t1_int] 
[t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "  Projection: t1.t1_id, t1.t1_name, t1.t1_int [t1_id:UInt32;N, 
t1_name:Utf8;N, t1_int:UInt32;N]",
+        "    CrossJoin: [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, 
COUNT(UInt8(1)):Int64;N]",
+        "      TableScan: t1 projection=[t1_id, t1_name, t1_int] 
[t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "      SubqueryAlias: __scalar_sq_1 [COUNT(UInt8(1)):Int64;N]",
+        "        Filter: COUNT(UInt8(1)) > Int64(0) [COUNT(UInt8(1)):Int64;N]",
+        "          Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]] 
[COUNT(UInt8(1)):Int64;N]",
+        "            Filter: t1.t1_id > t1.t1_int [t1_id:UInt32;N, 
t1_int:UInt32;N]",
+        "              TableScan: t1 projection=[t1_id, t1_int] 
[t1_id:UInt32;N, t1_int:UInt32;N]",

Review Comment:
   > I think the classic way to rewrite a subquery like
   > 
   > ```
   > WHERE  EXISTS (SELECT b FROM t2 where a>1 );
   > ```
   > 
   > Is to use a SEMI Join
   > 
   > ```
   > SemiJoin ()
   >   Filter(a > 1)
   >     TableScan(a, b)
   > ```
   > 
   > I don't think it is necessary to actually do an aggregate to count how 
many rows -- all that is needed is to check if there is any row
   > 
   > Perhaps @jackwener or @ygf11 have some suggestion on the way to extend 
subquery rewrites to support non correlated exists 🤔
   
   I am also confused, should directly from exists to join or let other 
optimizer to do the job.
   there already a optimizer ScalarSubqueryToJoin can do the job from scalar 
subquery to Join.
   
   like spark did
   
https://github.com/apache/spark/blob/4c938d62d791742b9f0c6a77b66fc06a90d7c0ad/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L62-L72
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jiangzhx commented on a diff in pull request #5907: optimize non-correlated where exists subquery rewrite to scalar subquery

Reply via email to