[DRBD-user] DRBD 8.4 two-node primary locks up, did not send a P_BARRIER

Sven-Erik Neve Tue, 23 Nov 2021 21:36:46 -0800

Hi all,

at work my team and I are facing a DRBD 8.4 two-node cluster where theprimary node seemingly randomly locks up, thereby preventing access toits data.


When this happens dmesg shows entries such as this one coming from DRBD:

We did not send a P_BARRIER for 5118944ms > ko-count (7) * timeout (60 *0.1s); drbd kernel thread blocked?

At the same time DRBD commands such as 'drbdadm secondary' or viewingstate at '/proc/drbd' no longer return results and just hang.

This is happening on a Debian 10 Xen virtual machine (via XCP-ng). Theinstalled 'drbd-utils' Debian package is version 9.5.0-1. The 'drbd.ko'module is version 8.4.10. Kernel is 4.19.208-1 installed via package'linux-image-4.19.0-18-amd64'. The config as shown via 'drbdadm dump' isavailable at Pastebin: https://pastebin.com/raw/b122wQU9.

The DRBD cluster is used as a block device for a ZFS zpool where ZFSitself is version '2.0.3-9~bpo10+1'-

Systems monitoring suggests that the issue occurs when disk loadmeasured in I/O wait time is higher than usual. Since we've now seenthis situation only twice that's not much of a pattern yet. Despite diskload seemingly being an issue none of the other virtual machine tenantson the same hypervisor and disk array are facing issues. The underlyingdisks are an SSD-based RAID 10 array of 4 disks total which are notexhibiting suspicious behavior or metrics. Does anyone have any pointersas to what might be going on here?

Google suggests RAM might be an issue, however, in both instances whenthis happened the node in question had about 15 GiB of free RAM out of atotal of 48 GiB.

Just for fun we're thinking about testing a Debian 11 backports kernelbut don't have any concrete direction to go in.


Any and all hints are greatly appreciated, thanks!
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] DRBD 8.4 two-node primary locks up, did not send a P_BARRIER

Reply via email to