liuml07 opened a new pull request, #27104: URL: https://github.com/apache/flink/pull/27104
## What is the purpose of the change https://issues.apache.org/jira/browse/FLINK-38499 Currently, the Curator framework used by ZK based HA is using the exponential backoff retry policy. However, the max sleep time is unbounded. That could cause unbounded sleep time when the retryCount is large. When that happens, recovery from ZK issues may be unreasonably slow. In my day job, we have a critical patch that limits the max sleep time after seeing multiple ZK issues in the past. In other Apache projects, the BoundedExponentialBackoffRetry is widely used, such as fluss, druid, hudi, bookeeper, phoeniex to name a few. This Jira proposes to limit the max sleep time by leveraging BoundedExponentialBackoffRetry, with a pretty high default value for starters. Users can change this via a new config option. ## Brief change log 1. Added new configuration option for HA: - Key: `high-availability.zookeeper.client.max-retry-wait` - Type: Duration - Default: 30 seconds (30000ms) - Description: Caps exponential backoff to prevent excessively long waits between retries 2. Updated retry policy in `ZooKeeperUtils` 3. Updated test files to use the new retry policy ## Verifying this change Updated existing tests. Ported from internally tested patch. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (**yes** / no / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (**yes** / no) - If yes, how is the feature documented? (not applicable / docs / **JavaDocs** / not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
