[
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xuzhou Qin updated SPARK-31968:
-------------------------------
Issue Type: Bug (was: Improvement)
> write.partitionBy() cannot handle duplicated column
> ---------------------------------------------------
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.5
> Reporter: Xuzhou Qin
> Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of
> write.partitionBy(), then the same partition subdirectory will be created
> multiple times.
> For example:
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
> (1, "p1", "c1", 1L),
> (2, "p2", "c2", 2L),
> (2, "p1", "c2", 2L),
> (3, "p3", "c3", 3L),
> (3, "p2", "c3", 3L),
> (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
> .partitionBy("col1", "col1") // we have "col1" twice
> .mode(SaveMode.Overwrite)
> .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>
> {code:java}
> output_dir
> |
> |--col1=1
> | |--col1=1
> |
> |--col1=2
> | |--col1=2
> |
> |--col1=3
> |--col1=3{code}
> And we won't be able to read the output
>
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found
> duplicate column(s) in the partition schema: `col1`;{code}
>
> I am not sure if partitioning a dataframe twice by the same column make sense
> in some real world applications, but it will cause schema inference problems
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe
> throw an exception when duplicated columns are detected?
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]