Said BOUDJELDA created KAFKA-20113:
--------------------------------------
Summary: Add Configurable Retry Parameters for Status Backing Store
Key: KAFKA-20113
URL: https://issues.apache.org/jira/browse/KAFKA-20113
Project: Kafka
Issue Type: New Feature
Components: connect
Reporter: Said BOUDJELDA
Assignee: Said BOUDJELDA
Fix For: 4.2.0
Implement configurable retry parameters for the +KafkaStatusBackingStore+
to address the TODO comment "retry more gracefully and not forever" and provide
operators with control over retry behavior during transient failures.
h3. Problem Statement
KafkaStatusBackingStore currently retries status updates indefinitely when
encountering retriable exceptions. This behavior is problematic because:
# *Infinite retry loops* can cause the worker to become unresponsive during
extended Kafka broker outages
# *No visibility* into retry behavior - operators cannot tune retry parameters
based on their environment
# *Resource exhaustion* - indefinite retries can consume threads and memory
during prolonged failures
# *No graceful degradation* - the system continues retrying without bound
rather than failing fast when appropriate
A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not
forever{}}}) explicitly acknowledges this issue needs addressing.
h3. Proposed Solution
Add four new configuration properties under the {{status.storage.}} prefix to
control retry behavior:
||Property||Type||Default||Description||
|{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts
before giving up|
|{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay in
milliseconds|
|{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap
in milliseconds|
|{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to
backoff after each attempt|
The retry mechanism uses *exponential backoff with jitter* to prevent
thundering herd problems during cluster recovery.
h4. Behavior
* Retries occur only for exceptions marked as {{RetriableException}}
* After exhausting {{{}max.retries{}}}, the operation logs an error and
terminates gracefully
* All retry attempts are logged at WARN level with attempt count and delay
information
* Non-retriable exceptions fail immediately without retry
*
h3. Benefits
# *Predictable failure modes* - Workers eventually give up and surface errors
instead of hanging
# *Operator control* - Tune retry behavior based on environment characteristics
# *Better observability* - Clear logging of retry attempts and outcomes
# *Backward compatible* - Default values maintain similar behavior to current
implementation
--
This message was sent by Atlassian Jira
(v8.20.10#820010)