Till Rohrmann commented on FLINK-5621:

Hi [~yanghua], I think such a feature would indeed be a nice addition for 
Flink. Black-listing TMs with known issues could be done in the 
{{ResourceManager}}. We could also add a RPC call which tells the {{TMs}} to 
shut down in such a case.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> ----------------------------------------------------------------------------------------------------
>                 Key: FLINK-5621
>                 URL: https://issues.apache.org/jira/browse/FLINK-5621
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.4
>            Reporter: Jamie Grier
>            Priority: Critical
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..

This message was sent by Atlassian JIRA

Reply via email to