[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381479#comment-16381479
 ] 

vinoyang commented on FLINK-5621:
---------------------------------

Hi [~jgrier] What about we can introduce "tm tag / label" mechanism(like YARN 
node label) for standalone cluster to mark different type taskmanagers. For 
example, "disk space insufficient", "network congestion" and so on. The task 
scheduler will pay attention to critical tags and avoid potential task failure 
risk. And we can report it as metrics and show these tags in web interface to 
let devOps monitor there nodes.

We are thinking about this feature in our inner Flink version at Tencent. 

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-5621
>                 URL: https://issues.apache.org/jira/browse/FLINK-5621
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.4
>            Reporter: Jamie Grier
>            Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to