ptlrs opened a new pull request, #10482:
URL: https://github.com/apache/ozone/pull/10482
## What changes were proposed in this pull request?
Some of the configuration values for the ratis client are too long such that
they make the client appear non-responsive.
This jira updates some of the configurations to make the client be more
responsive to failures on the server side
- Exponential Backoff (for TimeoutIOException on writes):
- Base sleep: 4s to 1s — faster initial retry
- Max sleep cap: 40s to 5s — don't wait long between retries on a dead
leader
- Max retries: unlimited (Integer.MAX_VALUE) to 2 — if 2 retries fail, the
leader is dead. We will let Ozone allocate a new pipeline
- Multilinear Random Retry (for generic/other exceptions):
- Policy: 5s×5, 10s×5, 15s×5, 20s×5, 25s×5, 60s×10 (~16 min, 35 retries)
to 5s×6 (~30s total) — fail fast instead of hanging
- Watch Timeout (waiting for ALL_COMMITTED replication):
- Overall watch timeout: 3 min to 30s
- Watch RPC timeout: 180s to 30s — aligned with server-side watch timeout
(also 30s). There's no point waiting longer than the server
- Write Timeout (overall budget for write retries):
- Write request timeout: 5 min to 70s — enough for one RPC (60s) + buffer.
This prevents retries from dragging on
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15444
## How was this patch tested?
CI: https://github.com/ptlrs/ozone/actions/runs/27291093969
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]