[jira] [Commented] (SPARK-23041) Inconsistent `drop`ing of columns in dataframes

Marco Gaido (JIRA) Thu, 08 Feb 2018 03:58:33 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356850#comment-16356850
 ]


Marco Gaido commented on SPARK-23041:
-------------------------------------

yes I am unable to reproduce this problem in master branch.

> Inconsistent `drop`ing of columns in dataframes
> -----------------------------------------------
>
>                 Key: SPARK-23041
>                 URL: https://issues.apache.org/jira/browse/SPARK-23041
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Christos Mantas
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is a known bug 
> [SPARK-13493](https://issues.apache.org/jira/browse/SPARK-13493) when reading 
> files in JSON format with case-sensitiveness. If eg. the file contains both 
> "test" and "TEST", Catalyst will complain on some occasions (eg. when writing 
> to a parquet file or creating an rdd from the df) with an error like this
> org.apache.spark.sql.AnalysisException: Reference 'TEST' is ambiguous, could 
> be: TEST#55L, TEST#57L.;
> This bug is not about that error, but a very peculiar side-effect, related to 
> it.  
> In short, in cases like the above, dropping the offending columns does not 
> have any effect.
> It's very easy to replicate:
> Here is a PySpark snippet illustrating it:
> {code:javascript}
> import pyspark
> from pyspark.sql import SparkSession
> sc = pyspark.SparkContext('local[*]')
> spark = SparkSession(sc)
> fname = '/tmp/test.json'
> with open(fname, "w") as text_file:
>     text_file.write("{\"test\":1, \"cool\": 3}\n{\"TEST\": 2, \"cool\": 4}")
> df = spark.read.json(fname)
> df_d = df.drop("test").drop("TEST")
> print(df_d.schema.names)
> df_d.rdd
> {code}
> This will print ['cool'], but will also produce the aforementioned exception, 
> meaning that it tries to make sense of columns that have actually been 
> dropped.
> Same happens when eg. you try to save the dataframe in a parquet file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23041) Inconsistent `drop`ing of columns in dataframes

Reply via email to