Hello Roland, First, thanks for your concern about my question. Le 28/06/2018 à 08:41, Roland Kammerer a écrit : > On Wed, Jun 27, 2018 at 12:37:20PM +0200, Julien Escario wrote: >> Hello, >> We're experiencing a really strange situation. >> We often play with : >> drbdmanage peer-device-options --resource <ressource> --c-max-rate <rate> >> >> especially when a node crash and need a (full) resync. >> >> When doing this, sometimes (after 10 or 20 such commands), we end up with >> drbdmanage completely stuck and a drbdsetup that seems to block on an IO with >> returning. >> For example : >> drbdsetup disk-options 144 --set-defaults --read-balancing=prefer-local >> --al-extents=6481 --al-updates=no --md-flushes=no >> >> drbdadm status display ressource up to this one then hangs on drbdsetup call. >> >> drbdtop is still usable. >> >> Right now, we didn't manage to find a solution without rebooting the node >> (sadly). >> >> Do you experience such situation ? >> What can cause this ? > > What version of DRBD9 is that (cat /proc/drbd)? drbdsetup hangs for a > reason, kernel related, not an actual bug in drbdsetup. "dmesg" at that > time would be interesting. Yes, I saw that, but not recently, only with > by now pretty old versions of DRBD9.
I posted a full dump of kernel messages here : https://framabin.org/p/?988ac2e36beabde6#a6UmS1uK/idqlPCCgoIar8oeEcNjRf8kCmdlECPu+V4= Versions : cat /proc/drbd version: 9.0.12-1 (api:2/proto:86-112) Transports (api:16): tcp (9.0.12-1) drbd-utils 9.3.0-1 drbdmanage-proxmox 2.1-1 Is that too old ? Can my problem be caused by two nodes only setup ? Is 3 nodes the required minimum for correct operation ? (even if I'm aware it's the recommended setup). >> Is there a way to unblock this process without rebooting ? > > Depends, but as a rule of thumb: when that happens the kernel is already > in a state where you want to/have to reboot. We can also see memory errors automatically corrected by ECC : Jun 28 04:02:37 vm8 kernel: [159134.557190] {24}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 Jun 28 04:02:37 vm8 kernel: [159134.557191] {24}[Hardware Error]: It has been corrected by h/w and requires no further action Jun 28 04:02:37 vm8 kernel: [159134.557192] {24}[Hardware Error]: event severity: corrected Jun 28 04:02:37 vm8 kernel: [159134.557193] {24}[Hardware Error]: Error 0, type: corrected Jun 28 04:02:37 vm8 kernel: [159134.557195] {24}[Hardware Error]: fru_text: CorrectedErr Jun 28 04:02:37 vm8 kernel: [159134.557197] {24}[Hardware Error]: section_type: memory error Jun 28 04:02:37 vm8 kernel: [159134.557198] {24}[Hardware Error]: node: 0 device: 1 Jun 28 04:02:37 vm8 kernel: [159134.557200] {24}[Hardware Error]: error_type: 2, single-bit ECC Perhaps related, I don't know. drbdbsetup process did also hangs on 'sane' node. But a night of memtest (3 complete passes) didn't detect any error. Best regards, Julien Escario _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
