Greetings! I've been experiencing and troubleshooting this problem for several months now with little success.
I'm using DRBD 8.4.11-1 in a 2-node, dual-primary cluster on CentOS 7.5.1804. This cluster is a HA virtualization solution based on KVM. Randomly, maybe once a month or so, the DRBD service on node2 will fail to finish a write request from node1 (sock_sendmsg time expired) and fencing is initiated by node1 which results in an IPMI reboot of node2. From what I can tell, there is increased disk activity on node1 that node2 can't keep up with. Hardware of the nodes is identical and the DRBD replication occurs over a dedicated, redundant 10G connection. I'll start by including some basic, sanitized configs and log messages. I can provide pretty detailed performance metrics from sysstat if necessary. Any help in troubleshooting this mystery is greatly appreciated. Please let me know if you need any other information. Thanks. DRBD configuration: https://pastebin.com/aNB7uB4r Node1 Logs: https://pastebin.com/aEhSWy1b Node2 Logs: https://pastebin.com/sFU84BWZ Hardware configuration: https://pastebin.com/jzRwxQeP RAID configuration and info: https://pastebin.com/vnGsUkHW -Chris H
_______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
