[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Scruggs updated SPARK-18065: ------------------------------------ Description: I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation. In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column: {code:title=Spark 1.6.2} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two")))).selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---+----+ | id|word| +---+----+ | 1| one| | 2| two| +---+----+ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id]; {code} However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds (no AnalysisException) and seems to filter out data as if the column remains: {code:title=Spark 2.0.1} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---+----+ | id|word| +---+----+ | 1| one| | 2| two| +---+----+ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() +---+ | id| +---+ | 1| +---+ {code} was: I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation. In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column: {code:title=Spark 1.6.2} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two")))).selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---+----+ | id|word| +---+----+ | 1| one| | 2| two| +---+----+ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id]; {code} However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems to filter out data as if the column remains: {code:title=Spark 2.0.1} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---+----+ | id|word| +---+----+ | 1| one| | 2| two| +---+----+ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() +---+ | id| +---+ | 1| +---+ {code} > Spark 2 allows filter/where on columns not in current schema > ------------------------------------------------------------ > > Key: SPARK-18065 > URL: https://issues.apache.org/jira/browse/SPARK-18065 > Project: Spark > Issue Type: Bug > Affects Versions: 2.0.0, 2.0.1 > Reporter: Matthew Scruggs > > I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a > DataFrame that previously had a column, but no longer has it in its schema > due to a select() operation. > In Spark 1.6.2, in spark-shell, we see that an exception is thrown when > attempting to filter/where using the selected-out column: > {code:title=Spark 1.6.2} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), > (2, "two")))).selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---+----+ > | id|word| > +---+----+ > | 1| one| > | 2| two| > +---+----+ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input > columns: [id]; > {code} > However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds > (no AnalysisException) and seems to filter out data as if the column remains: > {code:title=Spark 2.0.1} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df1 = sc.parallelize(Seq((1, "one"), (2, > "two"))).toDF().selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---+----+ > | id|word| > +---+----+ > | 1| one| > | 2| two| > +---+----+ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > +---+ > | id| > +---+ > | 1| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org