Hi all,

at work my team and I are facing a DRBD 8.4 two-node cluster where the primary node seemingly randomly locks up, thereby preventing access to its data.

When this happens dmesg shows entries such as this one coming from DRBD:

We did not send a P_BARRIER for 5118944ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?

At the same time DRBD commands such as 'drbdadm secondary' or viewing state at '/proc/drbd' no longer return results and just hang.

This is happening on a Debian 10 Xen virtual machine (via XCP-ng). The installed 'drbd-utils' Debian package is version 9.5.0-1. The 'drbd.ko' module is version 8.4.10. Kernel is 4.19.208-1 installed via package 'linux-image-4.19.0-18-amd64'. The config as shown via 'drbdadm dump' is available at Pastebin: https://pastebin.com/raw/b122wQU9.

The DRBD cluster is used as a block device for a ZFS zpool where ZFS itself is version '2.0.3-9~bpo10+1'-

Systems monitoring suggests that the issue occurs when disk load measured in I/O wait time is higher than usual. Since we've now seen this situation only twice that's not much of a pattern yet. Despite disk load seemingly being an issue none of the other virtual machine tenants on the same hypervisor and disk array are facing issues. The underlying disks are an SSD-based RAID 10 array of 4 disks total which are not exhibiting suspicious behavior or metrics. Does anyone have any pointers as to what might be going on here?

Google suggests RAM might be an issue, however, in both instances when this happened the node in question had about 15 GiB of free RAM out of a total of 48 GiB.

Just for fun we're thinking about testing a Debian 11 backports kernel but don't have any concrete direction to go in.

Any and all hints are greatly appreciated, thanks!
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to