[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280870#comment-17280870
 ] 

Piotr Nowojski commented on FLINK-17726:
----------------------------------------

Yes/maybe [~zhuzh]. I think you summarised the gist of the idea correctly. 
However there is one potential improvement:
{quote}
For"secondary" failures, given that the related "primary" failure should always 
be reported sooner or later, JM can simply mark the task as CANCELED and skip 
the failure handling.
{quote}
Maybe not in the first version, or maybe already in the first version, 
[~trohrmann] would like to tackle the problem to speed up failover, so that we 
do not have to wait for the primary failure to arrive. If JM already knows that 
some tasks started to fail (with secondary failures), it can already failover 
job/region, instead of waiting for example 1 minute for the heartbeats to time 
out. 

One thing that is not clear for me, is how to detect the primary failure in 
such case. Maybe we would need to failover the job but still keep collecting 
the failure reasons for the previous attempt, and keep updating the detected 
root cause lazily? For example if we have a chain of 4 tasks:
A->B->C->D
Maybe TaskManager handling A will fail silently, but the first error message JM 
will receive from D, then a second later from C then a second later from B and 
1 minute later a timeout of A.

Also note, that we don't have any pressure at the moment of fixing this right 
now.

> Scheduler should take care of tasks directly canceled by TaskManager
> --------------------------------------------------------------------
>
>                 Key: FLINK-17726
>                 URL: https://issues.apache.org/jira/browse/FLINK-17726
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Zhu Zhu
>            Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to