[
https://issues.apache.org/jira/browse/SPARK-56588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Toth reassigned SPARK-56588:
----------------------------------
Assignee: Peter Toth
> Dynamic partition overwrite (partitionOverwriteMode=DYNAMIC) behaves as
> append for Spark 4.1+ and HDFS
> -------------------------------------------------------------------------------------------------------
>
> Key: SPARK-56588
> URL: https://issues.apache.org/jira/browse/SPARK-56588
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 4.0.2, 3.5.8, 4.2.0, 4.1.2
> Reporter: Kazuyuki Tanimura
> Assignee: Peter Toth
> Priority: Major
> Labels: pull-request-available
>
> spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected
> for Spark 4.1+ and HDFS. Partition overwrites behave as appends, old data is
> preserved alongside new data in the same partition directory, resulting in
> duplicate rows. This affects both the session config path (spark.conf.set())
> and the inline write option path (df.write.option())
> This looks related to https://issues.apache.org/jira/browse/SPARK-54248
> Setting
> —conf
> spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
> —conf
> spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
>
> Solves the problem.
> Spark automatically applies
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter /
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are
> empty and spark-hadoop-cloud module is available.
> Magic committer itself may not be a problem, but silently breaking
> spark.sql.sources.partitionOverwriteMode=DYNAMIC behavior is not ideal.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]