[ 
https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19790:
---------------------------------
    Labels: bulk-closed  (was: )

> OutputCommitCoordinator should not allow another task to commit after an 
> ExecutorFailure
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-19790
>                 URL: https://issues.apache.org/jira/browse/SPARK-19790
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Imran Rashid
>            Priority: Major
>              Labels: bulk-closed
>
> The OutputCommitCoordinator resets the allowed committer when the task fails. 
>  
> https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no 
> idea what the status is of the task.  The task may actually still be running, 
> and perhaps successfully commit its output.  By allowing another task to 
> commit its output, there is a chance that multiple tasks commit, which can 
> result in corrupt output.  This would be particularly problematic when 
> commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit.  But with an 
> ExecutorFailure, its not clear what the right thing to do is.  The only safe 
> thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that 
> PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to