[ 
https://issues.apache.org/jira/browse/FLINK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn Visser updated FLINK-27396:
-----------------------------------
    Fix Version/s:     (was: 1.16.0)

> Reduce the Heartbeat timeout after zookeeper suspended
> ------------------------------------------------------
>
>                 Key: FLINK-27396
>                 URL: https://issues.apache.org/jira/browse/FLINK-27396
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: fanrui
>            Priority: Major
>
> After FLINK-10052, flink will tolerate zk suspension if 
> `high-availability.zookeeper.client.tolerate-suspended-connections` is 
> enabled. This feature is very useful, it reduces unnecessary Flink job 
> failover in case of zk server crashing some nodes or zk rolling restart.
> Two cases result in zk SUSPENDED:
>  * The zk server to which the TM/JM is connected is stopped
>  * TM has a network partition.
> For the first case, we hope Flink can tolerate it. For the second case, we 
> want the TM to fail fast, because the JM may have started a new TM, and if 
> this TM does not fail, it may deal with duplicate data (network partitioning 
> is complicated). But in the second case, TM will still run until zk 
> lost(high-availability.zookeeper.client.session-timeout, default 60s) or 
> heartbeat timeout with JM (heartbeat.timeout, default 50s).
> Can we set heartbeat.timeout to 20s if zk is suspended? If zk is suspended 
> and the heartbeat times out, execute zk lost related logic.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to