[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

Jason Moore (JIRA) Wed, 27 Apr 2016 17:06:48 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261252#comment-15261252
 ]


Jason Moore commented on SPARK-14915:
-------------------------------------

That's exactly my current thinking too.  But even if keep allowing some tasks 
to be retried without limit in certain contexts (the current two I'm aware of 
are: commit denied on speculative tasks or an executor lost because of a YARN 
de-allocation), it does seem that the commit denied is often happening when 
another copy has already succeeded.  I'm about to do some testing on this now, 
and not re-queuing in this scenario.

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14915
>                 URL: https://issues.apache.org/jira/browse/SPARK-14915
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.2
>            Reporter: Jason Moore
>            Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

Reply via email to