Hi Chenqin, The current rationale behind assuming a leadership loss when seeing a SUSPENDED connection is to assume the worst and to be on the safe side.
Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour configurable. Unfortunately, the community did not have enough time to complete this feature. [1] https://issues.apache.org/jira/browse/FLINK-10052 Cheers, Till On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <danrtsey...@gmail.com> wrote: > This might be related with FLINK-10052[1]. > Unfortunately, we do not have any progress on this ticket. > > cc @Till Rohrmann <trohrm...@apache.org> > > Best, > Yang > > chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道: > >> Hi there, >> >> We observed several 1.11 job running in 1.11 restart due to job leader >> lost. >> Dig deeper, the issue seems related to SUSPENDED state handler in >> ZooKeeperLeaderRetrievalService. >> >> ASFAIK, suspended state is expected when zk is not certain if leader is >> still alive. It can follow up with RECONNECT or LOST. In current >> implementation [1] , we treat suspended state same as lost state and >> actively shutdown job. This pose stability issue on large HA setting. >> >> My question is can we get some insight behind this decision and could we >> add >> some tunable configuration for user to decide how long they can endure >> such >> uncertain suspended state in their jobs. >> >> Thanks, >> Chen >> >> [1] >> >> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201 >> >> >> >> >> -- >> Sent from: >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >> >