Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Yang Wang Mon, 12 Apr 2021 23:26:36 -0700

This might be related with FLINK-10052[1].
Unfortunately, we do not have any progress on this ticket.


cc @Till Rohrmann <[email protected]>

Best,
Yang

chenqin <[email protected]> 于2021年4月13日周二 上午7:31写道：

> Hi there,
>
> We observed several 1.11 job running in 1.11 restart due to job leader
> lost.
> Dig deeper, the issue seems related to SUSPENDED state handler in
> ZooKeeperLeaderRetrievalService.
>
> ASFAIK, suspended state is expected when zk is not certain if leader is
> still alive. It can follow up with RECONNECT or LOST. In current
> implementation [1] , we treat suspended state same as lost state and
> actively shutdown job. This pose stability issue on large HA setting.
>
> My question is can we get some insight behind this decision and could we
> add
> some tunable configuration for user to decide how long they can endure such
> uncertain suspended state in their jobs.
>
> Thanks,
> Chen
>
> [1]
>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Reply via email to