[ 
https://issues.apache.org/jira/browse/FLINK-11537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-11537:
----------------------------------
    Description: 
The {{ExecutionGraph}} sometimes does not reach a terminal state if the 
{{JobMaster}} lost the leadership. The reason is that we use the fenced main 
thread executor to execute {{ExecutionGraph}} changes and we don't wait for the 
{{ExecutionGraph}} to reach the terminal state before we set the fencing token 
{{null}}.

One possible solution would be to wait for the {{ExecutionGraph}} to reach the 
terminal state before clearing the fencing token. This has, however, the 
downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} 
has been properly terminated. Alternatively, we could use the unfenced main 
thread executor to send the cancel calls out.

A Travis run where the problem occurred is here: 
https://travis-ci.org/tillrohrmann/flink/jobs/489119926

  was:
The {{ExecutionGraph}} sometimes does not reach a terminal state if the 
{{JobMaster}} lost the leadership. The reason is that we use the fenced main 
thread executor to execute {{ExecutionGraph}} changes and we don't wait for the 
{{ExecutionGraph}} to reach the terminal state before we set the fencing token 
{{null}}.

One possible solution would be to wait for the {{ExecutionGraph}} to reach the 
terminal state before clearing the fencing token. This has, however, the 
downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} 
has been properly terminated. Alternatively, we could use the unfenced main 
thread executor to send the cancel calls out.


> ExecutionGraph does not reach terminal state when JobMaster lost leadership
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-11537
>                 URL: https://issues.apache.org/jira/browse/FLINK-11537
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.8.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.8.0
>
>
> The {{ExecutionGraph}} sometimes does not reach a terminal state if the 
> {{JobMaster}} lost the leadership. The reason is that we use the fenced main 
> thread executor to execute {{ExecutionGraph}} changes and we don't wait for 
> the {{ExecutionGraph}} to reach the terminal state before we set the fencing 
> token {{null}}.
> One possible solution would be to wait for the {{ExecutionGraph}} to reach 
> the terminal state before clearing the fencing token. This has, however, the 
> downside that the {{JobMaster}} is still reachable until the 
> {{ExecutionGraph}} has been properly terminated. Alternatively, we could use 
> the unfenced main thread executor to send the cancel calls out.
> A Travis run where the problem occurred is here: 
> https://travis-ci.org/tillrohrmann/flink/jobs/489119926



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to