[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicate subdirectories when user provides duplicate columns

Xuzhou Qin (Jira) Sun, 14 Jun 2020 07:44:02 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xuzhou Qin updated SPARK-31968:
-------------------------------
    Summary: write.partitionBy() creates duplicate subdirectories when user 
provides duplicate columns  (was: write.partitionBy() creates duplicate 
subdirectories when user provide duplicate columns)

> write.partitionBy() creates duplicate subdirectories when user provides 
> duplicate columns
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-31968
>                 URL: https://issues.apache.org/jira/browse/SPARK-31968
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6
>            Reporter: Xuzhou Qin
>            Priority: Major
>             Fix For: 3.0.1, 3.1.0, 2.4.7
>
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   |    |--col1=1
>   |
>   |--col1=2
>   |    |--col1=2
>   |
>   |--col1=3
>        |--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicate subdirectories when user provides duplicate columns

Reply via email to