Mingliang Liu created FLINK-38499:
-------------------------------------

             Summary: Limit max sleep time in Curator for Zookeeper HA
                 Key: FLINK-38499
                 URL: https://issues.apache.org/jira/browse/FLINK-38499
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
            Reporter: Mingliang Liu


Currently, the Curator framework used by ZK based HA is using the exponential 
backoff retry policy. However, the max sleep time is unbounded. That could 
cause unbounded sleep time when the retryCount is large. When that happens, 
recovery from ZK issues may be unreasonably slow.

In my day job, we have a critical patch that limits the max sleep time after 
seeing multiple ZK issues in the past. In other Apache projects, the 
{{BoundedExponentialBackoffRetry}} is widely used, such as fluss, druid, hudi, 
bookeeper, phoeniex to name a few.

This Jira proposes to limit the max sleep time by leveraging 
BoundedExponentialBackoffRetry, with a pretty high default value for starters. 
Users can change this via a new config option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to