ethervoid opened a new issue, #3356:
URL: https://github.com/apache/kvrocks/issues/3356

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/kvrocks/issues) and found no similar issues.
   
   
   ### Version
   
   - Master: Kvrocks 2.10.1
   - Replica: Kvrocks 2.14.0
   - OS: Linux 6.12.53-69.119.amzn2023.aarch64 (Amazon Linux 2023)
   
   ### Minimal reproduce step
   
   1. Set up a master-replica configuration with:
        - High write rate (~7K ops/sec)
        - Small WAL retention: `rocksdb.max_total_wal_size 1024` (1GB)
   
     2. Introduce network congestion or slowness on the replica side that 
causes it to consume replication data slower than the
     master produces it
   
     3. The TCP send buffer on the master fills up, and the replication feed 
thread blocks on write() indefinitely
   
     4. Wait for WAL rotation to prune old WAL files while the feed thread is 
still blocked
   
     5. When the connection eventually drops, or the thread unblocks, observe:
        - Master logs: "Fatal error encountered, WAL iterator is discrete, some 
seq might be lost"
        - Replica attempts psync, fails with "sequence out of range"
        - Full resync is triggered
   
     The issue is that step 3 can last indefinitely (we observed 44 hours) with 
no timeout, errors, or warnings logged.
   
   ### What did you expect to see?
   
    1. The master should detect when a replica falls too far behind and 
proactively disconnect it before WAL is exhausted
   
     2. Socket sends to replicas should have a timeout to prevent indefinite 
blocking
   
     3. Warning logs when replication lag grows significantly
   
     4. When disconnected early (while the sequence is still in WAL), the 
replica should be able to psync successfully on reconnect
     instead of requiring a full resync
   
   ### What did you see instead?
   
   The replication feed thread blocked for 44 hours with no logs or errors:
   
     I20260127 22:16:21.006304  2857 replication.cc:115] WAL was rotated, would 
reopen again
                               [... 44 hours of silence ...]
     I20260129 18:36:55.603111  2857 replication.cc:115] WAL was rotated, would 
reopen again
     E20260129 18:36:55.646749  2857 replication.cc:126] Fatal error 
encountered, WAL iterator is discrete, some seq might be
     lost, sequence 480156205527 expected, but got 481055967952
     W20260129 18:36:55.646785  2857 replication.cc:84] Slave thread was 
terminated
   
     The replica then failed to psync ("sequence out of range") and required a 
full resync.
   
     Root cause: In FeedSlaveThread::loop(), the call to util::SockSend() (line 
225) blocks indefinitely when the TCP buffer is
     full. The underlying WriteImpl() has no timeout mechanism. During this 
blocked period, the master continues writing, and WAL
     files are pruned, leaving the replica's sequence no longer available.
   
   ### Anything Else?
   
   I've drafted a possible solution, using Claude code, given that I'm not an 
expert or dev for C++, that could address this issue with three components:
   
     1. **Socket send timeout**: New `SockSendWithTimeout()` function using 
poll() with configurable timeout (default 30s)
   
     2. **Replication lag detection**: Check lag at start of each loop 
iteration, disconnect if it exceeds the configurable threshold
     (default 100M sequences)
   
     3. **Exponential backoff on reconnection**: Prevents rapid reconnect loops 
for persistently slow replicas (1s, 2s, 4s... up
     to 60s)
   
     New configuration options:
     - `max-replication-lag`: Max sequence lag before disconnecting slow 
consumer
     - `replication-send-timeout-ms`: Socket send timeout in milliseconds
   
     I'm happy to submit the idea PR with the potential fix idea. The changes 
touch:
     - src/config/config.h, config.cc (new config options)
     - src/common/io_util.h, io_util.cc (SockSendWithTimeout)
     - src/cluster/replication.h, replication.cc (lag detection, timeout usage, 
backoff)
   
     Workaround for affected users: Increase `rocksdb.max_total_wal_size` 
significantly (e.g., 16GB) to extend WAL retention and
     reduce the likelihood of exhaustion
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to