This might be related with FLINK-10052[1]. Unfortunately, we do not have any progress on this ticket.
cc @Till Rohrmann <trohrm...@apache.org> Best, Yang chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道: > Hi there, > > We observed several 1.11 job running in 1.11 restart due to job leader > lost. > Dig deeper, the issue seems related to SUSPENDED state handler in > ZooKeeperLeaderRetrievalService. > > ASFAIK, suspended state is expected when zk is not certain if leader is > still alive. It can follow up with RECONNECT or LOST. In current > implementation [1] , we treat suspended state same as lost state and > actively shutdown job. This pose stability issue on large HA setting. > > My question is can we get some insight behind this decision and could we > add > some tunable configuration for user to decide how long they can endure such > uncertain suspended state in their jobs. > > Thanks, > Chen > > [1] > > https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201 > > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >