sryanyuan commented on PR #3357:
URL: https://github.com/apache/kvrocks/pull/3357#issuecomment-3840759574
I’ve gone through the changes and here are some observations and suggestions:
* Master not logging send errors
Currently, the master does not log any send errors until WAL sequence
continuation fails.
This seems to be caused by slow network transmission — each send
operation takes a long time to complete, which accumulates delay over time.
Eventually, WAL entries are cleaned up before the slave can catch up.
* Replication lag detection config
The newly added configuration for replication lag detection to
proactively disconnect a slave might help in some cases, but it may not fully
solve the slow transmission problem.
* Potential issue on the slave side
One major risk I see is that on the slave side, a half-open connection
can remain for a long time before triggering a timeout and reconnect, which
eventually leads to continuation failure.
Adding a read timeout on the slave side could help mitigate this
scenario.
* Master-side send timeout & efficiency improvements
Adding a send timeout on the master side could also help in
disconnecting half-open slave connections earlier.
However, to truly improve transmission efficiency and avoid this
situation, techniques such as compressing WAL logs before sending might be
worth considering.
Overall, the changes go in the right direction for detecting and handling
lag earlier, but I think addressing connection timeout handling (both master
and slave) and optimizing WAL transmission could make the solution more robust.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]