[ 
https://issues.apache.org/jira/browse/KAFKA-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Said BOUDJELDA updated KAFKA-20113:
-----------------------------------
    Description: 
Implement configurable retry parameters for the +KafkaStatusBackingStore+ to 
address the TODO comment "retry more gracefully and not forever" and provide 
operators with control over retry behavior during transient failures.
h3. Problem Statement

 
KafkaStatusBackingStore currently retries status updates indefinitely when 
encountering retriable exceptions. This behavior is problematic because:
 # *Infinite retry loops* can cause the worker to become unresponsive during 
extended Kafka broker outages
 # *No visibility* into retry behavior - operators cannot tune retry parameters 
based on their environment
 # *Resource exhaustion* - indefinite retries can consume threads and memory 
during prolonged failures
 # *No graceful degradation* - the system continues retrying without bound 
rather than failing fast when appropriate

A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not 
forever{}}}) explicitly acknowledges this issue needs addressing.
h3. Proposed Solution

Add four new configuration properties under the {{status.storage.}} prefix to 
control retry behavior:

 
||Property||Type||Default||Description||
|{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts 
before giving up|
|{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay in 
milliseconds|
|{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap 
in milliseconds|
|{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to 
backoff after each attempt|

The retry mechanism uses *exponential backoff with jitter* to prevent 
thundering herd problems during cluster recovery.
h4. Behavior
 * Retries occur only for exceptions marked as {{RetriableException}}
 * After exhausting {{{}max.retries{}}}, the operation logs an error and 
terminates gracefully
 * All retry attempts are logged at WARN level with attempt count and delay 
information
 * Non-retriable exceptions fail immediately without retry
 *  

h3. Benefits
 # *Predictable failure modes* - Workers eventually give up and surface errors 
instead of hanging
 # *Operator control* - Tune retry behavior based on environment characteristics
 # *Better observability* - Clear logging of retry attempts and outcomes
 # *Backward compatible* - Default values maintain similar behavior to current 
implementation

 

 

  was:
Implement configurable retry parameters for the +KafkaStatusBackingStore+
to address the TODO comment "retry more gracefully and not forever" and provide 
operators with control over retry behavior during transient failures.
h3. Problem Statement
 
KafkaStatusBackingStore currently retries status updates indefinitely when 
encountering retriable exceptions. This behavior is problematic because:
 # *Infinite retry loops* can cause the worker to become unresponsive during 
extended Kafka broker outages
 # *No visibility* into retry behavior - operators cannot tune retry parameters 
based on their environment
 # *Resource exhaustion* - indefinite retries can consume threads and memory 
during prolonged failures
 # *No graceful degradation* - the system continues retrying without bound 
rather than failing fast when appropriate

A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not 
forever{}}}) explicitly acknowledges this issue needs addressing.
h3. Proposed Solution

Add four new configuration properties under the {{status.storage.}} prefix to 
control retry behavior:

 
||Property||Type||Default||Description||
|{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts 
before giving up|
|{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay in 
milliseconds|
|{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap 
in milliseconds|
|{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to 
backoff after each attempt|

The retry mechanism uses *exponential backoff with jitter* to prevent 
thundering herd problems during cluster recovery.
h4. Behavior
 * Retries occur only for exceptions marked as {{RetriableException}}
 * After exhausting {{{}max.retries{}}}, the operation logs an error and 
terminates gracefully
 * All retry attempts are logged at WARN level with attempt count and delay 
information
 * Non-retriable exceptions fail immediately without retry
 *  

h3. Benefits
 # *Predictable failure modes* - Workers eventually give up and surface errors 
instead of hanging
 # *Operator control* - Tune retry behavior based on environment characteristics
 # *Better observability* - Clear logging of retry attempts and outcomes
 # *Backward compatible* - Default values maintain similar behavior to current 
implementation

 

 


> Add Configurable Retry Parameters for Status Backing Store
> ----------------------------------------------------------
>
>                 Key: KAFKA-20113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20113
>             Project: Kafka
>          Issue Type: New Feature
>          Components: connect
>            Reporter: Said BOUDJELDA
>            Assignee: Said BOUDJELDA
>            Priority: Major
>              Labels: configuration, connect, improvement, reliability
>             Fix For: 4.2.0
>
>
> Implement configurable retry parameters for the +KafkaStatusBackingStore+ to 
> address the TODO comment "retry more gracefully and not forever" and provide 
> operators with control over retry behavior during transient failures.
> h3. Problem Statement
>  
> KafkaStatusBackingStore currently retries status updates indefinitely when 
> encountering retriable exceptions. This behavior is problematic because:
>  # *Infinite retry loops* can cause the worker to become unresponsive during 
> extended Kafka broker outages
>  # *No visibility* into retry behavior - operators cannot tune retry 
> parameters based on their environment
>  # *Resource exhaustion* - indefinite retries can consume threads and memory 
> during prolonged failures
>  # *No graceful degradation* - the system continues retrying without bound 
> rather than failing fast when appropriate
> A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not 
> forever{}}}) explicitly acknowledges this issue needs addressing.
> h3. Proposed Solution
> Add four new configuration properties under the {{status.storage.}} prefix to 
> control retry behavior:
>  
> ||Property||Type||Default||Description||
> |{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts 
> before giving up|
> |{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay 
> in milliseconds|
> |{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap 
> in milliseconds|
> |{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to 
> backoff after each attempt|
> The retry mechanism uses *exponential backoff with jitter* to prevent 
> thundering herd problems during cluster recovery.
> h4. Behavior
>  * Retries occur only for exceptions marked as {{RetriableException}}
>  * After exhausting {{{}max.retries{}}}, the operation logs an error and 
> terminates gracefully
>  * All retry attempts are logged at WARN level with attempt count and delay 
> information
>  * Non-retriable exceptions fail immediately without retry
>  *  
> h3. Benefits
>  # *Predictable failure modes* - Workers eventually give up and surface 
> errors instead of hanging
>  # *Operator control* - Tune retry behavior based on environment 
> characteristics
>  # *Better observability* - Clear logging of retry attempts and outcomes
>  # *Backward compatible* - Default values maintain similar behavior to 
> current implementation
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to