[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectory when user give duplicated columns

Xuzhou Qin (Jira) Fri, 12 Jun 2020 01:45:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xuzhou Qin updated SPARK-31968:
-------------------------------
    Description: 
I recently remarked that if there are duplicated elements in the argument of 
write.partitionBy(), then the same partition subdirectory will be created 
multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  |    |--col1=1
  |
  |--col1=2
  |    |--col1=2
  |
  |--col1=3
       |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense 
in some real-world applications, but it will cause schema inference problems in 
tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw 
an exception when duplicated columns are detected?

If this behaviour is unexpected, I will work on a fix. 

  was:
I recently remarked that if there are duplicated elements in the argument of 
write.partitionBy(), then the same partition subdirectory will be created 
multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  |    |--col1=1
  |
  |--col1=2
  |    |--col1=2
  |
  |--col1=3
       |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense 
in some real world applications, but it will cause schema inference problems in 
tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw 
an exception when duplicated columns are detected?

If this is an unexpected behaviour, I will working on a fix. 


> write.partitionBy() creates duplicated subdirectory when user give duplicated 
> columns
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-31968
>                 URL: https://issues.apache.org/jira/browse/SPARK-31968
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Xuzhou Qin
>            Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   |    |--col1=1
>   |
>   |--col1=2
>   |    |--col1=2
>   |
>   |--col1=3
>        |--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectory when user give duplicated columns

Reply via email to