francis0407 commented on issue #23783: [SPARK-26854][SQL] Support ANY/SOME 
subquery
URL: https://github.com/apache/spark/pull/23783#issuecomment-477161182
 
 
   Hi @dongjoon-hyun , 
   I reconsidered your suggestion, and made some tests. 
   Semantically, all predicate subqueries(`IN`, `ANY`, `ALL`) are able to be 
transformed to `EXISTS`.  
   ```
   a in (select b from t) ->  exists(select b from t where a = b)
   
   a > any (select b from t) -> exists(select b from t where a > b)
   
   a > all (select b from t) -> not exists(select b from t where not(a > b))
   ```
   And this is truly useful because we only need to match the `exists` case 
after the transformation.
   
   But currently, this may cause some issues. Spark supports this query:
   ```
   scala> spark.sql(
        | """
        | select *
        | from a
        | where exists (select *
        |               from b
        |               where a1 in (select c1
        |                            from c))
        | """).explain(true)
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Filter exists#15 []
      :  +- 'Project [*]
      :     +- 'Filter 'a1 IN (list#14 [])
      :        :  +- 'Project ['c1]
      :        :     +- 'UnresolvedRelation `c`
      :        +- 'UnresolvedRelation `b`
      +- 'UnresolvedRelation `a`
   
   == Analyzed Logical Plan ==
   a1: int, a2: int
   Project [a1#2, a2#3]
   +- Filter exists#15 [a1#2]
      :  +- Project [b1#4, b2#5]
      :     +- Filter outer(a1#2) IN (list#14 [])
      :        :  +- Project [c1#6]
      :        :     +- SubqueryAlias `default`.`c`
      :        :        +- Relation[c1#6,c2#7] parquet
      :        +- SubqueryAlias `default`.`b`
      :           +- Relation[b1#4,b2#5] parquet
      +- SubqueryAlias `default`.`a`
         +- Relation[a1#2,a2#3] parquet
   
   == Optimized Logical Plan ==
   Join LeftSemi, exists#19: boolean
   :- Join ExistenceJoin(exists#19), (a1#2 = c1#6)
   :  :- Relation[a1#2,a2#3] parquet
   :  +- Project [c1#6]
   :     +- Relation[c1#6,c2#7] parquet
   +- Project
      +- Relation[b1#4,b2#5] parquet
   
   ```
   But not this one:
   ```
   scala> spark.sql(
        | """
        | select *
        | from a
        | where exists(select *
        |              from b
        |              where exists(select *
        |                           from c
        |                           where a1 = c1))
        | """).explain(true)
   org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input 
columns: [default.c.c1, default.c.c2]; line 8 pos 32;
   'Project [*]
   +- 'Filter exists#13 []
      :  +- 'Project [*]
      :     +- 'Filter exists#12 []
      :        :  +- 'Project [*]
      :        :     +- 'Filter ('a1 = c1#6)
      :        :        +- SubqueryAlias `default`.`c`
      :        :           +- Relation[c1#6,c2#7] parquet
      :        +- SubqueryAlias `default`.`b`
      :           +- Relation[b1#4,b2#5] parquet
      +- SubqueryAlias `default`.`a`
         +- Relation[a1#2,a2#3] parquet
   ```
   Therefore, I think we are not able to use this method in this PR. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to