[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

Maytas Monsereenusorn (Jira) Wed, 03 Jan 2024 17:15:22 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802355#comment-17802355
 ]


Maytas Monsereenusorn commented on SPARK-30421:
-----------------------------------------------

Not sure if this is the right place to ask this but seems like there are 
certain cases where the column will not be available for filtering.

This is also a regression from 2.1 to post 2.1.

Example query: SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = 
'2'

This works fine in 2.1 as for the reasons mentioned in this thread (due to 
_ResolveMissingReferences)_

However, after 2.1, the plan changed and SubqueryAlias was added. This seems to 
prevent ResolveMissingReferences from being able to change the project to add 
the Y column reference.

In Post Spark 2.1 (i.e. Spark 3.3):

 
{code:java}
      spark-sql-3.3> SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) 
WHERE Y = '2';
      Error in query: Column 'Y' does not exist. Did you mean one of the 
following? [__auto_generated_subquery_name.x]; line 1 pos 60;
      'Project [*]
      +- 'Filter ('Y = 2)
         +- SubqueryAlias __auto_generated_subquery_name
            +- Project [x#30]
               +- SubqueryAlias __auto_generated_subquery_name
                  +- Project [1 AS x#30, 2 AS Y#31]
                     +- OneRowRelation{code}
 
 * Spark 2.1:

 
{code:java}
spark-sql> SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = '2';
1
Time taken: 2.725 seconds, Fetched 1 row(s)
spark-sql> EXPLAIN EXTENDED SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS 
Y)) WHERE Y = '2';
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('Y = 2)
   +- 'Project ['x]
      +- Project [1 AS x#4, 2 AS Y#5]
         +- OneRowRelation$== Analyzed Logical Plan ==
x: int
Project [x#4]
+- Project [x#4]
   +- Filter (cast(Y#5 as bigint) = cast(2 as bigint))
      +- Project [x#4, Y#5]
         +- Project [1 AS x#4, 2 AS Y#5]
            +- OneRowRelation$== Optimized Logical Plan ==
Project [1 AS x#4]
+- OneRowRelation$== Physical Plan ==
*Project [1 AS x#4]
+- Scan OneRowRelation[]
Time taken: 0.813 seconds, Fetched 1 row(s)
 {code}
 

 

- Do we care that this is a regression? That the query used to work in 2.1 but 
now breaks in later version?
- Do we care that filter using non-existing columns as long as the column 
exists in the original table only work in certain case but not all cases (if 
you have SubqueryAlias)

> Dropped columns still available for filtering
> ---------------------------------------------
>
>                 Key: SPARK-30421
>                 URL: https://issues.apache.org/jira/browse/SPARK-30421
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.4
>            Reporter: Tobias Hermann
>            Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

Reply via email to