[ https://issues.apache.org/jira/browse/SPARK-24835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547123#comment-16547123 ]
Liang-Chi Hsieh edited comment on SPARK-24835 at 7/17/18 9:25 PM: ------------------------------------------------------------------ `drop` actually does to add a projection on top of original dataset. So the following query works: {code:java} df2 = df.drop('c') df2.where(df['c'] < 6).show() {code} It can be seen as a query (in Scala) like: {code:java} val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") df.select(df("name")).filter(df("id") === 0).show() {code} This is a valid query. Regarding the query can't be run: {code:java} df = df.drop('c') df.where(df['c'] < 6).show() {code} Because you want to resolve column {{c}} on the top of updated {{df}} which have output {{[a, b]}} now, you can't successfully resolve this column. was (Author: viirya): `drop` actually does to add a projection on top of original dataset. So the following query works: {code} df2 = df.drop('c') df2.where(df['c'] < 6).show() {code} It can be seen as a query (in Scala) like: {code} val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") val filter1 = df.select(df("name")).filter(df("id") === 0) {code} This is a valid query. Regarding the query can't be run: {code} df = df.drop('c') df.where(df['c'] < 6).show() {code} Because you want to resolve column {{c}} on the top of updated {{df}} which have output {{[a, b]}} now, you can't successfully resolve this column. > col function ignores drop > ------------------------- > > Key: SPARK-24835 > URL: https://issues.apache.org/jira/browse/SPARK-24835 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0 > Environment: Spark 2.3.0 > Python 3.5.3 > Reporter: Michael Souder > Priority: Minor > > Not sure if this is a bug or user error, but I've noticed that accessing > columns with the col function ignores a previous call to drop. > {code} > import pyspark.sql.functions as F > df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', > 'c']) > df.show() > +---+----+---+ > | a| b| c| > +---+----+---+ > | 1| 3| 5| > | 2|null| 7| > | 0| 3| 2| > +---+----+---+ > df = df.drop('c') > # the col function is able to see the 'c' column even though it has been > dropped > df.where(F.col('c') < 6).show() > +---+---+ > | a| b| > +---+---+ > | 1| 3| > | 0| 3| > +---+---+ > # trying the same with brackets on the data frame fails with the expected > error > df.where(df['c'] < 6).show() > Py4JJavaError: An error occurred while calling o36909.apply. > : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" > among (a, b);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org