The GitHub Actions job "CI" on kvrocks.git/fix_replication_falling_behind_and_freeze has failed. Run started by GitHub user ethervoid (triggered by git-hulk).
Head commit for run: 696eaf789f7feef68425c81e53d5e1bc878805fd / Mario de Frutos <[email protected]> fix(replication): prevent WAL exhaustion from slow consumers The replication feed thread could block indefinitely when sending data to a slow replica. If the replica wasn't consuming data fast enough, the TCP send buffer would fill and the feed thread would block on write() with no timeout. During this time, WAL files would rotate and be pruned, leaving the replica's sequence unavailable when the thread eventually unblocked or the connection dropped. This commit adds three mechanisms to address the issue: 1. Socket send timeout: New SockSendWithTimeout() function that uses poll() to wait for socket writability with a configurable timeout (default 30 seconds). This prevents indefinite blocking. 2. Replication lag detection: At the start of each loop iteration, check if the replica has fallen too far behind (configurable via max-replication-lag, default 100M sequences). If exceeded, disconnect the slow consumer before WAL is exhausted, allowing psync on reconnect. 3. Exponential backoff on reconnection: When a replica is disconnected, it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before reconnecting. This prevents rapid reconnection loops for persistently slow replicas. The backoff resets on successful psync or fullsync. New configuration options: - max-replication-lag: Maximum sequence lag before disconnecting (default: 100M) - replication-send-timeout-ms: Socket send timeout in ms (default: 30000) Fixes https://github.com/apache/kvrocks/issues/3356 Report URL: https://github.com/apache/kvrocks/actions/runs/21531645120 With regards, GitHub Actions via GitBox
