GitHub user ioana-delaney opened a pull request:

    https://github.com/apache/spark/pull/13867

    [SPARK-16161][SQL] Ambiguous error message for unsupported correlated 
predicate subqueries

    ## What changes were proposed in this pull request?
    Subqueries with deep correlation fail with ambiguous error message.
    
    **Problem repro:**
    ```SQL
    Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
    Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
    Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t3")
    
    sql("select c1 from t1 where c1 IN (select t2.c1 from t2 where t2.c2 IN 
(select t3.c2 from t3 where t3.c1 = t1.c1))").show()
    
    org.apache.spark.sql.AnalysisException: filter expression 'listquery()' of 
type array<null> is not a boolean.;
      at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
      at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
    ```
    
    Based on testing, Spark supports one level of correlation in predicate and 
scalar subqueries. An example of supported correlation is shown below.
    
    ```SQL
    select c1 from t1
    where c1 IN (select t2.c1 from t2 where t2.c2 IN (select t3.c2 from t3 
where t3.c1 = t2.c1))
    ```
    
    If the query has deep correlation, such as in the first example, where the 
inner subquery is correlated
    to the outer most query block, the above error message is issued. 
    
    This PR changes the error message to the following one:
    ```SQL
    Correlated column in subquery cannot be resolved: t1.c1; line 5 pos 28
    org.apache.spark.sql.AnalysisException: Correlated column in subquery 
cannot be resolved: t1.c1; line 5 pos 28
      at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
    ```
    
    **Problem description:**
    Rule ResolveSubqueries resolves subqueries by first invoking the Analyzer 
on the subquery tree and then attempting to resolve the correlated column 
references using the outer plans. If the subquery was succesfully resolved, the 
rule completes rewritting the subquery. Otherwise, the unresolved plan is 
returned. Later, CheckAnalysis will not report the original problem, but the 
result of cascading resolution failures. 
    
    **Solution:**
    When resolving an UnresolvedAlias in resolveOuterReferences(), use the 
entire sequence of available outer plans, instead of one outer plan at a time. 
With this design, resolveOuterReferences() resolves any leaf expressions, or 
issues an error if the column cannot be resolved. Then, if the correlation is 
resolved, the subsequent call to execute() will resolve any references to 
previously unresolved correlated columns.
    
    ## How was this patch tested?
    Add new test unit to the SubquerySuite.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ioana-delaney/spark fixErrMsg2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13867.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13867
    
----
commit 4191f8e657bb4ffde2e35b4b6f65e73d6f321354
Author: Ioana Delaney <[email protected]>
Date:   2016-06-23T01:20:13Z

    [SPARK-16161] Ambiguous error message for unsupported correlated predicate 
subqueries

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to