[ 
https://issues.apache.org/jira/browse/FLINK-18677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168001#comment-17168001
 ] 

Matthias commented on FLINK-18677:
----------------------------------

We were able to reproduce the behavior by submitting a long-running task (i.e. 
windowJoin example). By killing the StandaloneSession daemon and shortly 
afterwards stopping the ZooKeeper, we got into a situation where the 
TaskManager was not informed by the JobManager about the ZooKeeper connection 
being suspended.

The task kept processing data even though the suspended ZooKeeper connection 
was recognized (and logged out) by the TaskManager.

> ZooKeeperLeaderRetrievalService does not invalidate leader in case of 
> SUSPENDED connection
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-18677
>                 URL: https://issues.apache.org/jira/browse/FLINK-18677
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.1, 1.12.0, 1.11.1
>            Reporter: Till Rohrmann
>            Priority: Major
>             Fix For: 1.12.0
>
>
> The {{ZooKeeperLeaderRetrievalService}} does not invalidate the leader if the 
> ZooKeeper connection gets SUSPENDED. This means that a {{TaskManager}} won't 
> cancel its running tasks even though it might miss a leader change. I think 
> we should at least make it configurable whether in such a situation the 
> leader listener should be informed about the lost leadership. Otherwise, we 
> might run into the situation where an old and a newly recovered instance of a 
> {{Task}} can run at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to