[ 
https://issues.apache.org/jira/browse/RATIS-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated RATIS-2408:
------------------------------
    Description: 
*Problem*
 
Currently, the Netty DataStream client uses a fixed 100ms delay for 
reconnection attempts when the connection fails. This approach has several 
limitations:

1. Resource waste: During network issues or server unavailability, constant 
100ms retry intervals create unnecessary load
2. Thundering herd: Multiple clients reconnecting simultaneously can overwhelm 
the server
3. Lack of configurability: Users cannot tune reconnection behavior for their 
specific use cases
 
 

*Solution*

Implement configurable exponential backoff with jitter for DataStream client 
reconnections:

1. Configuration Support:
 - `raft.client.datastream.reconnect.delay` - Initial reconnection delay 
(default: 100ms)
 - `raft.client.datastream.reconnect.max-delay` - Maximum backoff delay 
(default: 5s)

2. Exponential Backoff:
 - Delay doubles on each failed attempt: 100ms → 200ms → 400ms → 800ms → 1600ms 
→ 5000ms
 - Resets to initial delay upon successful connection

3. Jitter (0.5x-1.5x):
 - Randomizes actual delay to avoid synchronized reconnection storms
 - Example: 1000ms base → actual delay between 500ms-1500ms

4. Concurrent Safety:
 - Prevents duplicate reconnection scheduling using atomic flags
 - Ensures cleanup even if reconnection is short-circuited

5. Adaptive Logging:
 - INFO level for short delays (≤500ms) - normal reconnection
 - WARN level for long delays (>500ms) - persistent failures

  was:
*Problem*
 
Currently, the Netty DataStream client uses a fixed 100ms delay for 
reconnection attempts when the connection fails. This approach has several 
limitations:

1. *{*}Resource waste{*}*: During network issues or server unavailability, 
constant 100ms retry intervals create unnecessary load
2. *{*}Thundering herd{*}*: Multiple clients reconnecting simultaneously can 
overwhelm the server
3. *{*}Lack of configurability{*}*: Users cannot tune reconnection behavior for 
their specific use cases
 
 

*Solution*

Implement configurable exponential backoff with jitter for DataStream client 
reconnections:

1. *{*}Configuration Support{*}*:
 - `raft.client.datastream.reconnect.delay` - Initial reconnection delay 
(default: 100ms)
 - `raft.client.datastream.reconnect.max-delay` - Maximum backoff delay 
(default: 5s)

2. *{*}Exponential Backoff{*}*:
 - Delay doubles on each failed attempt: 100ms → 200ms → 400ms → 800ms → 1600ms 
→ 5000ms
 - Resets to initial delay upon successful connection

3. *{*}Jitter (0.5x-1.5x){*}*:
 - Randomizes actual delay to avoid synchronized reconnection storms
 - Example: 1000ms base → actual delay between 500ms-1500ms

4. *{*}Concurrent Safety{*}*:
 - Prevents duplicate reconnection scheduling using atomic flags
 - Ensures cleanup even if reconnection is short-circuited

5. *{*}Adaptive Logging{*}*:
 - INFO level for short delays (≤500ms) - normal reconnection
 - WARN level for long delays (>500ms) - persistent failures


> Add configurable exponential backoff reconnection for Netty DataStream client
> -----------------------------------------------------------------------------
>
>                 Key: RATIS-2408
>                 URL: https://issues.apache.org/jira/browse/RATIS-2408
>             Project: Ratis
>          Issue Type: Improvement
>          Components: Netty
>            Reporter: Shilun Fan
>            Assignee: Shilun Fan
>            Priority: Major
>
> *Problem*
>  
> Currently, the Netty DataStream client uses a fixed 100ms delay for 
> reconnection attempts when the connection fails. This approach has several 
> limitations:
> 1. Resource waste: During network issues or server unavailability, constant 
> 100ms retry intervals create unnecessary load
> 2. Thundering herd: Multiple clients reconnecting simultaneously can 
> overwhelm the server
> 3. Lack of configurability: Users cannot tune reconnection behavior for their 
> specific use cases
>  
>  
> *Solution*
> Implement configurable exponential backoff with jitter for DataStream client 
> reconnections:
> 1. Configuration Support:
>  - `raft.client.datastream.reconnect.delay` - Initial reconnection delay 
> (default: 100ms)
>  - `raft.client.datastream.reconnect.max-delay` - Maximum backoff delay 
> (default: 5s)
> 2. Exponential Backoff:
>  - Delay doubles on each failed attempt: 100ms → 200ms → 400ms → 800ms → 
> 1600ms → 5000ms
>  - Resets to initial delay upon successful connection
> 3. Jitter (0.5x-1.5x):
>  - Randomizes actual delay to avoid synchronized reconnection storms
>  - Example: 1000ms base → actual delay between 500ms-1500ms
> 4. Concurrent Safety:
>  - Prevents duplicate reconnection scheduling using atomic flags
>  - Ensures cleanup even if reconnection is short-circuited
> 5. Adaptive Logging:
>  - INFO level for short delays (≤500ms) - normal reconnection
>  - WARN level for long delays (>500ms) - persistent failures



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to