Arun Lakshman created FLINK-37811:
-------------------------------------

             Summary: Flink Job stuck in suspend state after losing leadership 
in Zookeeper HA
                 Key: FLINK-37811
                 URL: https://issues.apache.org/jira/browse/FLINK-37811
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.20.0, 1.15.0
            Reporter: Arun Lakshman
         Attachments: notRecovered.csv

We have observed an inconsistent behavior pattern where the JobManager 
encounters ZooKeeper session timeout exceptions, leading to leadership loss 
across multiple components including Resource Manager, Job Master, and 
Dispatcher. When this occurs, the system exhibits an unexpected sequence - 
while components are in the process of shutting down, the ZooKeeper connection 
gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the 
JobManager process continues to run without performing a system exit. The 
initial trigger appears as a session timeout exception with message "Client 
session timed out, have not heard from server in 26678ms".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to