[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381763#comment-16381763
 ] 

vinoyang commented on FLINK-5621:
---------------------------------

Hi [~till.rohrmann] what's your opinion about this idea. Since Flink 1.5+, it's 
local recovery feature produced snapshot may also trigger the disk space 
insufficient frequently. If we collect task managers' metrics and mark them as 
some rules. The resource manager can consider these taskamangers as 
'dangerous'. Then the scheduler can avoid these tms.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-5621
>                 URL: https://issues.apache.org/jira/browse/FLINK-5621
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.4
>            Reporter: Jamie Grier
>            Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to