[ https://issues.apache.org/jira/browse/SPARK-23041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356850#comment-16356850 ]
Marco Gaido commented on SPARK-23041: ------------------------------------- yes I am unable to reproduce this problem in master branch. > Inconsistent `drop`ing of columns in dataframes > ----------------------------------------------- > > Key: SPARK-23041 > URL: https://issues.apache.org/jira/browse/SPARK-23041 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Christos Mantas > Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > There is a known bug > [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading > files in JSON format with case-sensitiveness. If eg. the file contains both > "test" and "TEST", Catalyst will complain on some occasions (eg. when writing > to a parquet file or creating an rdd from the df) with an error like this > org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could > be: TEST#55L, TEST#57L.; > This bug is not about that error, but a very peculiar side-effect, related to > it. > In short, in cases like the above, dropping the offending columns does not > have any effect. > It's very easy to replicate: > Here is a PySpark snippet illustrating it: > {code:javascript} > import pyspark > from pyspark.sql import SparkSession > sc = pyspark.SparkContext('local[*]') > spark = SparkSession(sc) > fname = '/tmp/test.json' > with open(fname, "w") as text_file: > text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}") > df = spark.read.json(fname) > df_d = df.drop("test").drop("TEST") > print(df_d.schema.names) > df_d.rdd > {code} > This will print ['cool'], but will also produce the aforementioned exception, > meaning that it tries to make sense of columns that have actually been > dropped. > Same happens when eg. you try to save the dataframe in a parquet file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org