zsxwing commented on a change in pull request #26312:
URL: https://github.com/apache/spark/pull/26312#discussion_r506051720
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -281,6 +281,10 @@ object FileFormatWriter extends Logging {
} catch {
case e: FetchFailedException =>
throw e
+ case f: FileAlreadyExistsException =>
Review comment:
If SPARK-27194 is not rare, it sounds a serious bug. Can we focus on
fixing SPARK-27194 instead? Maybe speed up the review for #29000? If
SPARK-27194 is resolved, we won't need this hack. Right? In addition, it's
weird that FileFormatWriter needs to understand the behavior of
`SQLHadoopMapReduceCommitProtocol`. It would be great if we can avoid leaking
the implementation details of a commit protocol to `FileFormatWriter`.
Regarding user cases, I have seen multiple customers hitting recoverable
`FileAlreadyExistsException` caused by
https://issues.apache.org/jira/browse/HADOOP-17015 . But they could not upgrade
the their Hadoop version. It's much harder to upgrade Hadoop than Spark. This
change makes their jobs fail occasionally after upgrading to Spark 3.0 because
Spark doesn't retry `FileAlreadyExistsException`. And like what you said, the
user cannot change Spark's behavior to retry `FileAlreadyExistsException`.
Their jobs should have been finished but because Spark didn't retry, they
wasted hours of work.
Throwing spark specific exception for commit protocol errors cannot resolve
this because the issue is in the underlying FileSystem implementation called by
Spark directly.
IMO, we need to make the tradeoff between:
- Make a job successful if we retry `FileAlreadyExistsException`, but a
failed job may take more time to fail.
- Make a job fail when it should have been successful if we retried
`FileAlreadyExistsException`, but make a failed job fail fast.
I prefer the first one as we can make more jobs successful and the behavior
is the same as before.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]