[ 
https://issues.apache.org/jira/browse/FLINK-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15877368#comment-15877368
 ] 

ASF GitHub Bot commented on FLINK-5830:
---------------------------------------

Github user zhijiangW commented on the issue:

    https://github.com/apache/flink/pull/3360
  
    Hi @tillrohrmann , thank you for reviews and positive suggestions!
    
    I try to explain the root case of this issue first:
    
    From JobMaster side, it sends the cancel rpc message and gets the 
acknowledge from TaskExecutor, then the execution state transition to 
**CANCELING**.
    
    From TaskExecutor side, it would notify the final state to JobMaster before 
task exits. The **notifyFinalState** can be divided into two steps:
    
    - Execute the **RunAsync** message by akka actor and this is a tell action, 
and it will trigger to run **unregisterTaskAndNotifyFinalState**.
    - In process of **unregisterTaskAndNotifyFinalState**, it will trigger the 
rpc message of **updateTaskExecutionState** , and it is a ask action, so the 
mechanism can avoid lost message.
    
    The problem is that it may cause OOM before trigger 
**updateTaskExecutionState**, and this error is caught by **AkkaRpcActor** and 
does not do anything resulting in interrupting the following process. The 
**updateTaskExecutionState** will not be executed anymore.
    
    For the key point interaction between TaskExecutor and JobMaster, it should 
not tolerate lost message, and I agree with your above suggestions. So there 
may be two ideas for this improvement:
    
    - Enhance the robustness of **notifyFinalState**, and the current rethrow 
OOM is an easy option, but it will cause the TaskExecutor exit,there should be 
other ways to make the cost reduction.
    - After get cancel acknowledge in JobMaster side, it will trigger a timeout 
to check the execution final state. If the execution has not entered the final 
state within timeout, the JobMaster can resend the acknowledge message to 
TaskExecutor to confirm the status.
    
    And I prefers the first way to just make it sense in one side, avoid the 
complex interaction between TaskExecutor and JobMaster. 
    
    Wish your further suggestions or any ideas.


> OutOfMemoryError during notify final state in TaskExecutor may cause job stuck
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-5830
>                 URL: https://issues.apache.org/jira/browse/FLINK-5830
>             Project: Flink
>          Issue Type: Bug
>            Reporter: zhijiang
>            Assignee: zhijiang
>
> The scenario is like this:
> {{JobMaster}} tries to cancel all the executions when process failed 
> execution, and the task executor already acknowledge the cancel rpc message.
> When notify the final state in {{TaskExecutor}}, it causes OOM in 
> {{AkkaRpcActor}} and this error is caught to log the info. The final state 
> will not be sent any more.
> The {{JobMaster}} can not receive the final state and trigger the restart 
> strategy.
> One solution is to catch the {{OutOfMemoryError}} and throw it, then it will 
> cause to shut down the {{ActorSystem}} resulting in exiting the 
> {{TaskExecutor}}. The {{JobMaster}} can be notified of {{TaskExecutor}} 
> failure and fail all the tasks to trigger restart successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to