sryanyuan commented on PR #3357:
URL: https://github.com/apache/kvrocks/pull/3357#issuecomment-3840759574

   I’ve gone through the changes and here are some observations and suggestions:
   
   * Master not logging send errors
   
       Currently, the master does not log any send errors until WAL sequence 
continuation fails.
   
       This seems to be caused by slow network transmission — each send 
operation takes a long time to complete, which accumulates delay over time. 
Eventually, WAL entries are cleaned up before the slave can catch up.
   
   * Replication lag detection config
   
       The newly added configuration for replication lag detection to 
proactively disconnect a slave might help in some cases, but it may not fully 
solve the slow transmission problem.
   
   * Potential issue on the slave side
   
       One major risk I see is that on the slave side, a half-open connection 
can remain for a long time before triggering a timeout and reconnect, which 
eventually leads to continuation failure.
   
       Adding a read timeout on the slave side could help mitigate this 
scenario.
   
   * Master-side send timeout & efficiency improvements
   
       Adding a send timeout on the master side could also help in 
disconnecting half-open slave connections earlier.
   
       However, to truly improve transmission efficiency and avoid this 
situation, techniques such as compressing WAL logs before sending might be 
worth considering.
   
   Overall, the changes go in the right direction for detecting and handling 
lag earlier, but I think addressing connection timeout handling (both master 
and slave) and optimizing WAL transmission could make the solution more robust.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to