Mingliang Liu created FLINK-38499:
-------------------------------------
Summary: Limit max sleep time in Curator for Zookeeper HA
Key: FLINK-38499
URL: https://issues.apache.org/jira/browse/FLINK-38499
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Reporter: Mingliang Liu
Currently, the Curator framework used by ZK based HA is using the exponential
backoff retry policy. However, the max sleep time is unbounded. That could
cause unbounded sleep time when the retryCount is large. When that happens,
recovery from ZK issues may be unreasonably slow.
In my day job, we have a critical patch that limits the max sleep time after
seeing multiple ZK issues in the past. In other Apache projects, the
{{BoundedExponentialBackoffRetry}} is widely used, such as fluss, druid, hudi,
bookeeper, phoeniex to name a few.
This Jira proposes to limit the max sleep time by leveraging
BoundedExponentialBackoffRetry, with a pretty high default value for starters.
Users can change this via a new config option.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)