[ 
https://issues.apache.org/jira/browse/SPARK-56588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth resolved SPARK-56588.
--------------------------------
    Fix Version/s: 4.2.0
                   3.5.9
                   4.1.2
                   4.0.3
                   5.0.0
       Resolution: Fixed

Issue resolved by pull request 55622
[https://github.com/apache/spark/pull/55622]

> Dynamic partition overwrite (partitionOverwriteMode=DYNAMIC) behaves as 
> append for Spark 4.1+ and HDFS 
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56588
>                 URL: https://issues.apache.org/jira/browse/SPARK-56588
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.0.2, 3.5.8, 4.2.0, 4.1.2
>            Reporter: Kazuyuki Tanimura
>            Assignee: Peter Toth
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0, 3.5.9, 4.1.2, 4.0.3, 5.0.0
>
>
> spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected 
> for Spark  4.1+ and HDFS. Partition overwrites behave as appends, old data is 
> preserved alongside new data in the same partition directory, resulting in 
> duplicate rows. This affects both the session config path (spark.conf.set()) 
> and the inline write option path (df.write.option())
> This looks related to https://issues.apache.org/jira/browse/SPARK-54248
> Setting 
>  —conf 
> spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
>  —conf 
> spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
>  
> Solves the problem.
> Spark automatically applies 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter / 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are 
> empty and spark-hadoop-cloud module is available. 
> Magic committer itself may not be a  problem, but silently breaking 
> spark.sql.sources.partitionOverwriteMode=DYNAMIC behavior is not ideal. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to