[ 
https://issues.apache.org/jira/browse/FLINK-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872933#comment-16872933
 ] 

zhijiang edited comment on FLINK-12889 at 6/26/19 5:33 AM:
-----------------------------------------------------------

My previous analysis in ML was as follows:
  
 Actually all the five "Source: ServiceLog" tasks are not in terminal state on 
JM view, the relevant processes shown in below: 
 * The checkpoint in task causes OOM issue which would call 
`Task#failExternally` as a result, we could see the log "Attempting to fail 
task externally" in TM.
 * The source task would transform state from RUNNING to FAILED and then starts 
a canceler thread for canceling task, we could see log "Triggering cancellation 
of task" in TM.
 * When JM starts to cancel the source tasks, the rpc call 
`Task#cancelExecution` would find the task was already in FAILED state as above 
step 2, we could see log "Attempting to cancel task" in TM.

 
 At last all the five source tasks are not in terminal states from jm log, I 
guess the step 2 might not create canceler thread successfully, because the 
root failover was caused by OOM during creating native thread in step1, so it 
might exist possibilities that createing canceler thread is not successful as 
well in OOM case which is unstable. If so, the source task would not been 
interrupted at all, then it would not report to JM as well, but the state is 
already changed to FAILED before. 
  
 For the other vertex tasks, it does not trigger `Task#failExternally` in step 
1, and only receives the cancel rpc from JM in step 3. And I guess at this time 
later than the source period, the canceler thread could be created succesfully 
after some GCs, then these tasks could be canceled as reported to JM side.
  
 I think the key problem is under OOM case some behaviors are not within 
expectations, so it might bring problems. Maybe we should handle OOM error in 
extreme way like making TM exit to solve the potential issue.
  
 [~till.rohrmann] do you think it is worth fixing or have other concerns?


was (Author: zjwang):
My previous analysis in ML was as follows:
  
 Actually all the five "Source: ServiceLog" tasks are not in terminal state on 
JM view, the relevant processes shown in below: 
 * The checkpoint in task causes OOM issue which would call 
`Task#failExternally` as a result, we could see the log "Attempting to fail 
task externally" in tm.
 * The source task would transform state from RUNNING to FAILED and then starts 
a canceler thread for canceling task, we could see log "Triggering cancellation 
of task" in tm.
 * When JM starts to cancel the source tasks, the rpc call 
`Task#cancelExecution` would find the task was already in FAILED state as above 
step 2, we could see log "Attempting to cancel task" in tm.

 
 At last all the five source tasks are not in terminal states from jm log, I 
guess the step 2 might not create canceler thread successfully, because the 
root failover was caused by OOM during creating native thread in step1, so it 
might exist possibilities that createing canceler thread is not successful as 
well in OOM case which is unstable. If so, the source task would not been 
interrupted at all, then it would not report to JM as well, but the state is 
already changed to FAILED before. 
  
 For the other vertex tasks, it does not trigger `Task#failExternally` in step 
1, and only receives the cancel rpc from JM in step 3. And I guess at this time 
later than the source period, the canceler thread could be created succesfully 
after some GCs, then these tasks could be canceled as reported to JM side.
  
 I think the key problem is under OOM case some behaviors are not within 
expectations, so it might bring problems. Maybe we should handle OOM error in 
extreme way like making TM exit to solve the potential issue.
  
 [~till.rohrmann] do you think it is worth fixing or have other concerns?

> Job keeps in FAILING state
> --------------------------
>
>                 Key: FLINK-12889
>                 URL: https://issues.apache.org/jira/browse/FLINK-12889
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Fan Xinpu
>            Priority: Minor
>         Attachments: 20190618104945417.jpg, jobmanager.log.2019-06-16.0, 
> taskmanager.log
>
>
> There is a topology of 3 operator, such as, source, parser, and persist. 
> Occasionally, 5 subtasks of the source encounters exception and turns to 
> failed, at the same time, one subtask of the parser runs into exception and 
> turns to failed too. The jobmaster gets a message of the parser's failed. The 
> jobmaster then try to cancel all the subtask, most of the subtasks of the 
> three operator turns to canceled except the 5 subtasks of the source, because 
> the state of the 5 ones is already FAILED before jobmaster try to cancel it. 
> Then the jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state. 
>  
> The job run on a flink 1.7 cluster on yarn, and there is only one tm with 10 
> slots.
>  
> The attached files contains a jm log , tm log and the ui picture.
>  
> The exception timestamp is about 2019-06-16 13:42:28.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to