Github user zhijiangW commented on the issue:

    https://github.com/apache/flink/pull/3360
  
    Hi @tillrohrmann , thank you for reviews and positive suggestions!
    
    I try to explain the root case of this issue first:
    
    From JobMaster side, it sends the cancel rpc message and gets the 
acknowledge from TaskExecutor, then the execution state transition to 
**CANCELING**.
    
    From TaskExecutor side, it would notify the final state to JobMaster before 
task exits. The **notifyFinalState** can be divided into two steps:
    
    - Execute the **RunAsync** message by akka actor and this is a tell action, 
and it will trigger to run **unregisterTaskAndNotifyFinalState**.
    - In process of **unregisterTaskAndNotifyFinalState**, it will trigger the 
rpc message of **updateTaskExecutionState** , and it is a ask action, so the 
mechanism can avoid lost message.
    
    The problem is that it may cause OOM before trigger 
**updateTaskExecutionState**, and this error is caught by **AkkaRpcActor** and 
does not do anything resulting in interrupting the following process. The 
**updateTaskExecutionState** will not be executed anymore.
    
    For the key point interaction between TaskExecutor and JobMaster, it should 
not tolerate lost message, and I agree with your above suggestions. So there 
may be two ideas for this improvement:
    
    - Enhance the robustness of **notifyFinalState**, and the current rethrow 
OOM is an easy option, but it will cause the TaskExecutor exit,there should 
be other ways to make the cost reduction.
    - After get cancel acknowledge in JobMaster side, it will trigger a timeout 
to check the execution final state. If the execution has not entered the final 
state within timeout, the JobMaster can resend the acknowledge message to 
TaskExecutor to confirm the status.
    
    And I prefers the first way to just make it sense in one side, avoid the 
complex interaction between TaskExecutor and JobMaster. 
    
    Wish your further suggestions or any ideas.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to