[
https://issues.apache.org/jira/browse/FLINK-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen resolved FLINK-1668.
---------------------------------
Resolution: Implemented
Implemented in abbb0a93ca67da17197dc5372e6d95edd8149d44
> Add a config option to specify delays between restarts
> ------------------------------------------------------
>
> Key: FLINK-1668
> URL: https://issues.apache.org/jira/browse/FLINK-1668
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 0.9
> Reporter: Stephan Ewen
> Assignee: Stephan Ewen
> Fix For: 0.9
>
>
> The system currently introduces a short delay between a failed task execution
> and the restarted execution.
> The reason is that this delay seemed to help in letting problems surface that
> let to the failed task. As an example, if a TaskManager fails, tasks fail due
> to data transfer errors. The TaskManager is not immediately recognized as
> failed, though (takes a bit until heartbeats time out). Immediately
> re-deploying tasks has a very high chance of assigning work to the
> TaskManager that is actually not responding, causing the execution retry to
> fail again. The delay gives the system time to figure out that the
> TaskManager was lost and does not take it into account upon the retry.
> Currently, the system uses the heartbeat timeout as the default delay value.
> This may make sense as a default value for critical task failures, but is
> actually quite high for other types of failures.
> In any case, I would like to add an option for users to specify the delay
> (even set it to 0, if desired).
> The delay is not the best solution, in my opinion, we should eventually move
> to something better. Ideas are to put TaskManagers responsible for failed
> tasks in a "probationary" mode until they have reported back that everything
> is good (still alive, disk space available, etc)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)