The GitHub Actions job "CI" on 
kvrocks.git/fix_replication_falling_behind_and_freeze has failed.
Run started by GitHub user ethervoid (triggered by git-hulk).

Head commit for run:
c392ce711f1cf57938c2f5ce1eb8ec6cb3e926cb / Mario de Frutos <[email protected]>
fix(replication): prevent WAL exhaustion from slow consumers

The replication feed thread could block indefinitely when sending data
to a slow replica. If the replica wasn't consuming data fast enough,
the TCP send buffer would fill and the feed thread would block on
write() with no timeout. During this time, WAL files would rotate and
be pruned, leaving the replica's sequence unavailable when the thread
eventually unblocked or the connection dropped.

This commit adds three mechanisms to address the issue:

1. Socket send timeout: New SockSendWithTimeout() function that uses
   poll() to wait for socket writability with a configurable timeout
   (default 30 seconds). This prevents indefinite blocking.

2. Replication lag detection: At the start of each loop iteration,
   check if the replica has fallen too far behind (configurable via
   max-replication-lag). If exceeded, disconnect the slow consumer
   before WAL is exhausted, allowing psync on reconnect.
   Disabled by default (0), set to a positive value to enable.

3. Exponential backoff on reconnection: When a replica is disconnected,
   it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before
   reconnecting. This prevents rapid reconnection loops for persistently
   slow replicas. The backoff resets on successful psync or fullsync.

New configuration options:
- max-replication-lag: Maximum sequence lag before disconnecting (default: 0 = 
disabled)
- replication-send-timeout-ms: Socket send timeout in ms (default: 30000)

Fixes https://github.com/apache/kvrocks/issues/3356

Report URL: https://github.com/apache/kvrocks/actions/runs/21588736336

With regards,
GitHub Actions via GitBox

Reply via email to