Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Till Rohrmann Tue, 13 Apr 2021 01:35:20 -0700

Hi Chenqin,

The current rationale behind assuming a leadership loss when seeing a
SUSPENDED connection is to assume the worst and to be on the safe side.


Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
configurable. Unfortunately, the community did not have enough time to
complete this feature.

[1] https://issues.apache.org/jira/browse/FLINK-10052

Cheers,
Till

On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[email protected]> wrote:

> This might be related with FLINK-10052[1].
> Unfortunately, we do not have any progress on this ticket.
>
> cc @Till Rohrmann <[email protected]>
>
> Best,
> Yang
>
> chenqin <[email protected]> 于2021年4月13日周二 上午7:31写道：
>
>> Hi there,
>>
>> We observed several 1.11 job running in 1.11 restart due to job leader
>> lost.
>> Dig deeper, the issue seems related to SUSPENDED state handler in
>> ZooKeeperLeaderRetrievalService.
>>
>> ASFAIK, suspended state is expected when zk is not certain if leader is
>> still alive. It can follow up with RECONNECT or LOST. In current
>> implementation [1] , we treat suspended state same as lost state and
>> actively shutdown job. This pose stability issue on large HA setting.
>>
>> My question is can we get some insight behind this decision and could we
>> add
>> some tunable configuration for user to decide how long they can endure
>> such
>> uncertain suspended state in their jobs.
>>
>> Thanks,
>> Chen
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>>
>

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Reply via email to