Arun Lakshman created FLINK-37811: ------------------------------------- Summary: Flink Job stuck in suspend state after losing leadership in Zookeeper HA Key: FLINK-37811 URL: https://issues.apache.org/jira/browse/FLINK-37811 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.20.0, 1.15.0 Reporter: Arun Lakshman Attachments: notRecovered.csv
We have observed an inconsistent behavior pattern where the JobManager encounters ZooKeeper session timeout exceptions, leading to leadership loss across multiple components including Resource Manager, Job Master, and Dispatcher. When this occurs, the system exhibits an unexpected sequence - while components are in the process of shutting down, the ZooKeeper connection gets RECONNECTED, but jobs still enter a SUSPENDED state. Notably, the JobManager process continues to run without performing a system exit. The initial trigger appears as a session timeout exception with message "Client session timed out, have not heard from server in 26678ms". -- This message was sent by Atlassian Jira (v8.20.10#820010)