[
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802355#comment-17802355
]
Maytas Monsereenusorn commented on SPARK-30421:
-----------------------------------------------
Not sure if this is the right place to ask this but seems like there are
certain cases where the column will not be available for filtering.
This is also a regression from 2.1 to post 2.1.
Example query: SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y =
'2'
This works fine in 2.1 as for the reasons mentioned in this thread (due to
_ResolveMissingReferences)_
However, after 2.1, the plan changed and SubqueryAlias was added. This seems to
prevent ResolveMissingReferences from being able to change the project to add
the Y column reference.
In Post Spark 2.1 (i.e. Spark 3.3):
{code:java}
spark-sql-3.3> SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y))
WHERE Y = '2';
Error in query: Column 'Y' does not exist. Did you mean one of the
following? [__auto_generated_subquery_name.x]; line 1 pos 60;
'Project [*]
+- 'Filter ('Y = 2)
+- SubqueryAlias __auto_generated_subquery_name
+- Project [x#30]
+- SubqueryAlias __auto_generated_subquery_name
+- Project [1 AS x#30, 2 AS Y#31]
+- OneRowRelation{code}
* Spark 2.1:
{code:java}
spark-sql> SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = '2';
1
Time taken: 2.725 seconds, Fetched 1 row(s)
spark-sql> EXPLAIN EXTENDED SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS
Y)) WHERE Y = '2';
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('Y = 2)
+- 'Project ['x]
+- Project [1 AS x#4, 2 AS Y#5]
+- OneRowRelation$== Analyzed Logical Plan ==
x: int
Project [x#4]
+- Project [x#4]
+- Filter (cast(Y#5 as bigint) = cast(2 as bigint))
+- Project [x#4, Y#5]
+- Project [1 AS x#4, 2 AS Y#5]
+- OneRowRelation$== Optimized Logical Plan ==
Project [1 AS x#4]
+- OneRowRelation$== Physical Plan ==
*Project [1 AS x#4]
+- Scan OneRowRelation[]
Time taken: 0.813 seconds, Fetched 1 row(s)
{code}
- Do we care that this is a regression? That the query used to work in 2.1 but
now breaks in later version?
- Do we care that filter using non-existing columns as long as the column
exists in the original table only work in certain case but not all cases (if
you have SubqueryAlias)
> Dropped columns still available for filtering
> ---------------------------------------------
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.4
> Reporter: Tobias Hermann
> Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar"
> would exist.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]