Xuzhou Qin created SPARK-31968:
----------------------------------
Summary: write.partitionBy() cannot handle duplicated column
Key: SPARK-31968
URL: https://issues.apache.org/jira/browse/SPARK-31968
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.5
Reporter: Xuzhou Qin
I recently remarked that if there are duplicated elements in the argument of
write.partitionBy(), then the same partition subdirectory will be created
multiple times.
For example:
{code:java}
import spark.implicits._
val df: DataFrame = Seq(
(1, "p1", "c1", 1L),
(2, "p2", "c2", 2L),
(2, "p1", "c2", 2L),
(3, "p3", "c3", 3L),
(3, "p2", "c3", 3L),
(3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")
df.write
.partitionBy("col1", "col1") // we have "col1" twice
.mode(SaveMode.Overwrite)
.csv("output_dir"){code}
The above code will produce an output directory with this structure:
{code:java}
output_dir
|
|--col1=1
| |--col1=1
|
|--col1=2
| |--col1=2
|
|--col1=3
|--col1=3{code}
And we won't be able to read the output
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found
duplicate column(s) in the partition schema: `col1`;{code}
I am not sure if partitioning a dataframe twice by the same column make sense
in some real world applications, but it will cause schema inference problems in
tools like AWS Glue crawler.
Should Spark handle the deduplication of the partition columns? Or maybe throw
an exception when duplicated columns are detected?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]