[
https://issues.apache.org/jira/browse/KAFKA-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Said BOUDJELDA updated KAFKA-20113:
-----------------------------------
Description:
Implement configurable retry parameters for the +KafkaStatusBackingStore+ to
address the TODO comment "retry more gracefully and not forever" and provide
operators with control over retry behavior during transient failures.
h3. Problem Statement
KafkaStatusBackingStore currently retries status updates indefinitely when
encountering retriable exceptions. This behavior is problematic because:
# *Infinite retry loops* can cause the worker to become unresponsive during
extended Kafka broker outages
# *No visibility* into retry behavior - operators cannot tune retry parameters
based on their environment
# *Resource exhaustion* - indefinite retries can consume threads and memory
during prolonged failures
# *No graceful degradation* - the system continues retrying without bound
rather than failing fast when appropriate
A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not
forever{}}}) explicitly acknowledges this issue needs addressing.
h3. Proposed Solution
Add four new configuration properties under the {{status.storage.}} prefix to
control retry behavior:
||Property||Type||Default||Description||
|{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts
before giving up|
|{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay in
milliseconds|
|{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap
in milliseconds|
|{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to
backoff after each attempt|
The retry mechanism uses *exponential backoff with jitter* to prevent
thundering herd problems during cluster recovery.
h4. Behavior
* Retries occur only for exceptions marked as {{RetriableException}}
* After exhausting {{{}max.retries{}}}, the operation logs an error and
terminates gracefully
* All retry attempts are logged at WARN level with attempt count and delay
information
* Non-retriable exceptions fail immediately without retry
*
h3. Benefits
# *Predictable failure modes* - Workers eventually give up and surface errors
instead of hanging
# *Operator control* - Tune retry behavior based on environment characteristics
# *Better observability* - Clear logging of retry attempts and outcomes
# *Backward compatible* - Default values maintain similar behavior to current
implementation
was:
Implement configurable retry parameters for the +KafkaStatusBackingStore+
to address the TODO comment "retry more gracefully and not forever" and provide
operators with control over retry behavior during transient failures.
h3. Problem Statement
KafkaStatusBackingStore currently retries status updates indefinitely when
encountering retriable exceptions. This behavior is problematic because:
# *Infinite retry loops* can cause the worker to become unresponsive during
extended Kafka broker outages
# *No visibility* into retry behavior - operators cannot tune retry parameters
based on their environment
# *Resource exhaustion* - indefinite retries can consume threads and memory
during prolonged failures
# *No graceful degradation* - the system continues retrying without bound
rather than failing fast when appropriate
A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not
forever{}}}) explicitly acknowledges this issue needs addressing.
h3. Proposed Solution
Add four new configuration properties under the {{status.storage.}} prefix to
control retry behavior:
||Property||Type||Default||Description||
|{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts
before giving up|
|{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay in
milliseconds|
|{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap
in milliseconds|
|{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to
backoff after each attempt|
The retry mechanism uses *exponential backoff with jitter* to prevent
thundering herd problems during cluster recovery.
h4. Behavior
* Retries occur only for exceptions marked as {{RetriableException}}
* After exhausting {{{}max.retries{}}}, the operation logs an error and
terminates gracefully
* All retry attempts are logged at WARN level with attempt count and delay
information
* Non-retriable exceptions fail immediately without retry
*
h3. Benefits
# *Predictable failure modes* - Workers eventually give up and surface errors
instead of hanging
# *Operator control* - Tune retry behavior based on environment characteristics
# *Better observability* - Clear logging of retry attempts and outcomes
# *Backward compatible* - Default values maintain similar behavior to current
implementation
> Add Configurable Retry Parameters for Status Backing Store
> ----------------------------------------------------------
>
> Key: KAFKA-20113
> URL: https://issues.apache.org/jira/browse/KAFKA-20113
> Project: Kafka
> Issue Type: New Feature
> Components: connect
> Reporter: Said BOUDJELDA
> Assignee: Said BOUDJELDA
> Priority: Major
> Labels: configuration, connect, improvement, reliability
> Fix For: 4.2.0
>
>
> Implement configurable retry parameters for the +KafkaStatusBackingStore+ to
> address the TODO comment "retry more gracefully and not forever" and provide
> operators with control over retry behavior during transient failures.
> h3. Problem Statement
>
> KafkaStatusBackingStore currently retries status updates indefinitely when
> encountering retriable exceptions. This behavior is problematic because:
> # *Infinite retry loops* can cause the worker to become unresponsive during
> extended Kafka broker outages
> # *No visibility* into retry behavior - operators cannot tune retry
> parameters based on their environment
> # *Resource exhaustion* - indefinite retries can consume threads and memory
> during prolonged failures
> # *No graceful degradation* - the system continues retrying without bound
> rather than failing fast when appropriate
> A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not
> forever{}}}) explicitly acknowledges this issue needs addressing.
> h3. Proposed Solution
> Add four new configuration properties under the {{status.storage.}} prefix to
> control retry behavior:
>
> ||Property||Type||Default||Description||
> |{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts
> before giving up|
> |{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay
> in milliseconds|
> |{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap
> in milliseconds|
> |{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to
> backoff after each attempt|
> The retry mechanism uses *exponential backoff with jitter* to prevent
> thundering herd problems during cluster recovery.
> h4. Behavior
> * Retries occur only for exceptions marked as {{RetriableException}}
> * After exhausting {{{}max.retries{}}}, the operation logs an error and
> terminates gracefully
> * All retry attempts are logged at WARN level with attempt count and delay
> information
> * Non-retriable exceptions fail immediately without retry
> *
> h3. Benefits
> # *Predictable failure modes* - Workers eventually give up and surface
> errors instead of hanging
> # *Operator control* - Tune retry behavior based on environment
> characteristics
> # *Better observability* - Clear logging of retry attempts and outcomes
> # *Backward compatible* - Default values maintain similar behavior to
> current implementation
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)