[GitHub] [spark] zsxwing commented on a change in pull request #26312: [SPARK-29649][SQL] Stop task set if FileAlreadyExistsException was thrown when writing to output file

GitBox Thu, 15 Oct 2020 21:59:24 -0700


zsxwing commented on a change in pull request #26312:
URL: https://github.com/apache/spark/pull/26312#discussion_r506051720




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -281,6 +281,10 @@ object FileFormatWriter extends Logging {
     } catch {
       case e: FetchFailedException =>
         throw e
+      case f: FileAlreadyExistsException =>

Review comment:
       If SPARK-27194 is not rare, it sounds a serious bug. Can we focus on 
fixing SPARK-27194 instead? Maybe speed up the review for #29000? If 
SPARK-27194 is resolved, we won't need this hack. Right? In addition, it's 
weird that FileFormatWriter needs to understand the behavior of 
`SQLHadoopMapReduceCommitProtocol`. It would be great if we can avoid leaking 
the implementation details of a commit protocol to `FileFormatWriter`.
   
   Regarding user cases, I have seen multiple customers hitting recoverable 
`FileAlreadyExistsException` caused by 
https://issues.apache.org/jira/browse/HADOOP-17015 . But they could not upgrade 
the their Hadoop version. It's much harder to upgrade Hadoop than Spark. This 
change makes their jobs fail occasionally after upgrading to Spark 3.0 because 
Spark doesn't retry `FileAlreadyExistsException`. And like what you said, the 
user cannot change Spark's behavior to retry `FileAlreadyExistsException`. 
Their jobs should have been finished but because Spark didn't retry, they 
wasted hours of work.
   
   Throwing spark specific exception for commit protocol errors cannot resolve 
this because the issue is in the underlying FileSystem implementation called by 
Spark directly.
   
   IMO, we need to make the tradeoff between:
   
   - Make a job successful if we retry `FileAlreadyExistsException`, but a 
failed job may take more time to fail.
   - Make a job fail when it should have been successful if we retried 
`FileAlreadyExistsException`, but make a failed job fail fast.
   
   I prefer the first one as we can make more jobs successful and the behavior 
is the same as before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zsxwing commented on a change in pull request #26312: [SPARK-29649][SQL] Stop task set if FileAlreadyExistsException was thrown when writing to output file

Reply via email to