GitHub user ioana-delaney opened a pull request:
https://github.com/apache/spark/pull/13867
[SPARK-16161][SQL] Ambiguous error message for unsupported correlated
predicate subqueries
## What changes were proposed in this pull request?
Subqueries with deep correlation fail with ambiguous error message.
**Problem repro:**
```SQL
Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t3")
sql("select c1 from t1 where c1 IN (select t2.c1 from t2 where t2.c2 IN
(select t3.c2 from t3 where t3.c1 = t1.c1))").show()
org.apache.spark.sql.AnalysisException: filter expression 'listquery()' of
type array<null> is not a boolean.;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
```
Based on testing, Spark supports one level of correlation in predicate and
scalar subqueries. An example of supported correlation is shown below.
```SQL
select c1 from t1
where c1 IN (select t2.c1 from t2 where t2.c2 IN (select t3.c2 from t3
where t3.c1 = t2.c1))
```
If the query has deep correlation, such as in the first example, where the
inner subquery is correlated
to the outer most query block, the above error message is issued.
This PR changes the error message to the following one:
```SQL
Correlated column in subquery cannot be resolved: t1.c1; line 5 pos 28
org.apache.spark.sql.AnalysisException: Correlated column in subquery
cannot be resolved: t1.c1; line 5 pos 28
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
```
**Problem description:**
Rule ResolveSubqueries resolves subqueries by first invoking the Analyzer
on the subquery tree and then attempting to resolve the correlated column
references using the outer plans. If the subquery was succesfully resolved, the
rule completes rewritting the subquery. Otherwise, the unresolved plan is
returned. Later, CheckAnalysis will not report the original problem, but the
result of cascading resolution failures.
**Solution:**
When resolving an UnresolvedAlias in resolveOuterReferences(), use the
entire sequence of available outer plans, instead of one outer plan at a time.
With this design, resolveOuterReferences() resolves any leaf expressions, or
issues an error if the column cannot be resolved. Then, if the correlation is
resolved, the subsequent call to execute() will resolve any references to
previously unresolved correlated columns.
## How was this patch tested?
Add new test unit to the SubquerySuite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ioana-delaney/spark fixErrMsg2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13867.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13867
----
commit 4191f8e657bb4ffde2e35b4b6f65e73d6f321354
Author: Ioana Delaney <[email protected]>
Date: 2016-06-23T01:20:13Z
[SPARK-16161] Ambiguous error message for unsupported correlated predicate
subqueries
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]