jiangzhx commented on code in PR #5907:
URL: https://github.com/apache/arrow-datafusion/pull/5907#discussion_r1166648579
##########
datafusion/core/tests/sql/subqueries.rs:
##########
@@ -129,12 +129,14 @@ async fn exists_subquery_with_same_table() -> Result<()> {
let expected = vec![
"Explain [plan_type:Utf8, plan:Utf8]",
- " Filter: EXISTS (<subquery>) [t1_id:UInt32;N, t1_name:Utf8;N,
t1_int:UInt32;N]",
- " Subquery: [t1_int:UInt32;N]",
- " Projection: t1.t1_int [t1_int:UInt32;N]",
- " Filter: t1.t1_id > t1.t1_int [t1_id:UInt32;N, t1_name:Utf8;N,
t1_int:UInt32;N]",
- " TableScan: t1 [t1_id:UInt32;N, t1_name:Utf8;N,
t1_int:UInt32;N]",
- " TableScan: t1 projection=[t1_id, t1_name, t1_int]
[t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+ " Projection: t1.t1_id, t1.t1_name, t1.t1_int [t1_id:UInt32;N,
t1_name:Utf8;N, t1_int:UInt32;N]",
+ " CrossJoin: [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N,
COUNT(UInt8(1)):Int64;N]",
+ " TableScan: t1 projection=[t1_id, t1_name, t1_int]
[t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+ " SubqueryAlias: __scalar_sq_1 [COUNT(UInt8(1)):Int64;N]",
+ " Filter: COUNT(UInt8(1)) > Int64(0) [COUNT(UInt8(1)):Int64;N]",
+ " Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]
[COUNT(UInt8(1)):Int64;N]",
+ " Filter: t1.t1_id > t1.t1_int [t1_id:UInt32;N,
t1_int:UInt32;N]",
+ " TableScan: t1 projection=[t1_id, t1_int]
[t1_id:UInt32;N, t1_int:UInt32;N]",
Review Comment:
> I think the classic way to rewrite a subquery like
>
> ```
> WHERE EXISTS (SELECT b FROM t2 where a>1 );
> ```
>
> Is to use a SEMI Join
>
> ```
> SemiJoin ()
> Filter(a > 1)
> TableScan(a, b)
> ```
>
> I don't think it is necessary to actually do an aggregate to count how
many rows -- all that is needed is to check if there is any row
>
> Perhaps @jackwener or @ygf11 have some suggestion on the way to extend
subquery rewrites to support non correlated exists 🤔
I am also confused, should directly from exists to join or let other
optimizer to do the job.
there already a optimizer ScalarSubqueryToJoin can do the job from scalar
subquery to Join.
like spark did
https://github.com/apache/spark/blob/4c938d62d791742b9f0c6a77b66fc06a90d7c0ad/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L62-L72
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]