[ 
https://issues.apache.org/jira/browse/FLINK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-10439.
---------------------------------
    Resolution: Duplicate

> Race condition during job suspension
> ------------------------------------
>
>                 Key: FLINK-10439
>                 URL: https://issues.apache.org/jira/browse/FLINK-10439
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.7.0
>            Reporter: Ufuk Celebi
>            Priority: Major
>         Attachments: master-logs.log, race-job-suspension.png, worker-logs.log
>
>
> When a {{JobMaster}} in an HA setup looses leadership, it suspends the 
> execution of its job via {{JobMaster.suspend(Exception, Time)}}. This 
> operation involves transitioning to the {{SUSPENDING}} job state and 
> cancelling all running tasks. In some executions it may happen that the job 
> does *not* reach the terminal {{SUSPENDED}} job state.
> This is due to the fact that suspending the job stops related RPC endpoints 
> such as the {{JobMaster}} or {{SlotPool}} (in {{JobMaster.suspend(Exception, 
> Time)}} and {{JobMaster.suspendExecution( Exception)}}) immediately after 
> suspending. Whenever this happens *before* the {{TaskExecutor}} instances 
> have cancelled or failed the respective tasks, the job does not transition to 
> {{SUSPENDED}}, because the {{ExecutionGraph}} does not receive all 
> {{Execution}} state transitions.
> In practice, this should not happen frequently due the fact that 
> {{JobMaster}} and {{TaskExecutor}} instances are notified about the loss of 
> leadership (or loss of ZooKeeper connection or similar events) around the 
> same time. In this scenario, the {{TaskExecutor}} instances proactively fail 
> the executing tasks and notify the {{JobMaster}}. All in all, the impact of 
> this is limited by the fact that a new {{JobMaster}} leader will eventually 
> recover the job.
> *Steps to reproduce*:
> - Start ZooKeeper
> - Start a Flink cluster in HA mode and submit job
> - Stop ZooKeeper
> In some executions you will find that the job does not reach the terminal 
> state {{SUSPENDED}}. Furthermore, you may see log messages similar to the 
> following in this case:
> {code}
> The rpc endpoint org.apache.flink.runtime.jobmaster.slotpool.SlotPool has not 
> been started yet. Discarding message 
> org.apache.flink.runtime.rpc.messages.LocalRpcInvocation until processing is 
> started.
> {code}
> I've attached a logs of a local run that does not transition to {{SUSPENDED}} 
> and a sequence diagram of what I think may be a problematic timing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to