GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/21745
[SPARK-24781][SQL] Using a reference from Dataset in Filter/Sort might not
work
## What changes were proposed in this pull request?
When we use a reference from Dataset in filter or sort, which was not used
in the prior select, an AnalysisException occurs, e.g.,
```scala
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
df.select(df("name")).filter(df("id") === 0).show()
```
```scala
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing
from name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
+- AnalysisBarrier
+- Project [name#5]
+- Project [_1#2 AS name#5, _2#3 AS id#6]
+- LocalRelation [_1#2, _2#3]
```
This change adds a condition `missingInput.isEmpty` to `resolved` of
`LogicalPlan`. Previously a logical plan is resolved if all expressions are
resolved and its children are resolved. However, as we possibly add a resolved
reference like `df("name")` into a query plan, it is possible that all
expressions in a query plan are resolved but have missing inputs.
## How was this patch tested?
Added tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-24781
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21745.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21745
----
commit 97837a46b790ceb1f0df38cc7a3094b1cb4eb556
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-07-11T07:44:43Z
Resolved references from Dataset should be checked if it is missed from
plan.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]