[
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-42439:
------------------------------------
Assignee: (was: Apache Spark)
> Job description in v2 FileWrites can have the wrong committer
> -------------------------------------------------------------
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.3.1
> Reporter: Lorenzo Martini
> Priority: Minor
>
> There is a difference in behavior between v1 writes and v2 writes in the
> order of events happening when configuring the file writer and the committer.
> v1:
> # writer.prepareWrite()
> # committer.setupJob()
> v2:
> # committer.setupJob()
> # writer.prepareWrite()
>
> This is because the `prepareWrite()` call (that is the one performing the
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be
> missing, for example the aforementioned output format class not being set,
> causing the committer being set up as generic write instead of parquet write.
>
> The fix is very simple and it is to make the `createJobDescription` call
> non-lazy
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]