Greetings everyone,
We are running a small 3 node drbd 9.0.21 cluster, used for opennebula vm
storage. Lately we have started recieving messages from one of our servers,
about getting a "Wrong magic value 0x52464200 in protocol version 86". We are
using lacp bonded 2 gig links for traffic exchange between drbd hosts. Both
lacp links are working fine and no outages were detected previously. It's also
interesting to point out that drbd9 seems to "ignore" the problematic node and
doesn't place any vm's on it.
Here is a list of kernel versions on our servers:
100 - 4.15.0-72-generic #81-Ubuntu SMP #controller node
101 - 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 #storage node
102 - 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 #storage node
We tried changing network rendering engines, checking /tmp dir size, making
non-lacp network between servers, and searching for network inconsistencies and
even used different networking equipment but unfortunately, we haven't
achieved any results.
The problem is highly reproducible in our environment ( even after fully
reinstalling drbd, we still get magic value error ether on 101 or 102, while
resources are avoiding deployment on 102).
We get following logs in kernel ring buffer on 102:
[Wed Feb 5 21:02:29 2020] drbd one-image-10: susp-io( no -> user)
[Wed Feb 5 21:02:32 2020] drbd one-image-10: susp-io( user -> no)
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0: Starting worker thread
(from drbdsetup [20138])
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Starting
sender thread (from drbdsetup [20143])
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Starting
sender thread (from drbdsetup [20146])
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: conn(
StandAlone -> Unconnected )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Starting
receiver thread (from drbd_w_one-vm-1 [20139])
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: conn(
Unconnected -> Connecting )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
StandAlone -> Unconnected )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Starting
receiver thread (from drbd_w_one-vm-1 [20139])
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Unconnected -> Connecting )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Failed to
initiate connection, err=-98
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: conn(
Connecting -> Disconnecting )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Aborting
remote state change 0 commit not possible
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Restarting
sender thread
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Connection
closed
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: conn(
Disconnecting -> StandAlone )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Terminating
receiver thread
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Failed to
initiate connection, err=-98
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Connecting -> Disconnecting )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Aborting
remote state change 0 commit not possible
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Restarting
sender thread
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Connection
closed
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Disconnecting -> StandAlone )
[Wed Feb 5 21:02:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Terminating
receiver thread
[Wed Feb 5 21:07:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n100: Terminating
sender thread
[Wed Feb 5 21:07:35 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Terminating
sender thread
[Wed Feb 5 21:07:35 2020] drbd one-vm-1385-disk-0/0 drbd1099: drbd_bm_resize
called with capacity == 0
[Wed Feb 5 21:07:35 2020] drbd one-vm-1385-disk-0: Terminating worker thread
Seems quite a similar case described here:
https://lists.linbit.com/pipermail/drbd-user/2009-October/012777.html
At the same time on 100 (controller node):
[Wed Feb 5 21:02:29 2020] drbd one-image-10: susp-io( no -> user)
[Wed Feb 5 21:02:33 2020] drbd one-image-10: susp-io( user -> no)
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0: Starting worker thread
(from drbdsetup [3965])
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Starting
sender thread (from drbdsetup [3971])
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Starting
sender thread (from drbdsetup [3975])
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: meta-data IO
uses: blk-bio
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: disk( Diskless
-> Attaching )
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: Maximum number
of peer devices = 7
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0: Method to ensure write
ordering: flush
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: drbd_bm_resize
called with capacity == 4201520
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: resync bitmap:
bits=525190 words=57449 pages=113
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: size = 2052 MB
(2100760 KB)
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: size = 2052 MB
(2100760 KB)
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: recounting of
set bits took additional 0ms
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: disk( Attaching
-> UpToDate )
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0/0 drbd1099: attached to
current UUID: C518769AA829C854
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
StandAlone -> Unconnected )
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Starting
receiver thread (from drbd_w_one-vm-1 [3966])
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Unconnected -> Connecting )
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
StandAlone -> Unconnected )
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Starting
receiver thread (from drbd_w_one-vm-1 [3966])
[Wed Feb 5 21:02:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
Unconnected -> Connecting )
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Handshake to
peer 0 successful: Agreed network protocol version 116
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Feature
flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Peer
authenticated using 20 bytes HMAC
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Starting
ack_recv thread (from drbd_r_one-vm-1 [4004])
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Preparing
remote state change 1265772256
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Committing
remote state change 1265772256 (primary_nodes=0)
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Connecting -> Connected ) peer( Unknown -> Secondary )
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0/0 drbd1099 nts-cloud-n101:
drbd_sync_handshake:
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0/0 drbd1099 nts-cloud-n101:
self C518769AA829C854:0000000000000000:0000000000000000:0000000000000000 bits:0
flags:20
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0/0 drbd1099 nts-cloud-n101:
peer C518769AA829C854:0000000000000000:0000000000000000:0000000000000000 bits:0
flags:20
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0/0 drbd1099 nts-cloud-n101:
uuid_compare()=no-sync by rule 40
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0/0 drbd1099: quorum( no ->
yes )
[Wed Feb 5 21:02:37 2020] drbd one-vm-1385-disk-0/0 drbd1099 nts-cloud-n101:
pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
[Wed Feb 5 21:02:48 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Wrong magic
value 0x52464220 in protocol version 86
[Wed Feb 5 21:02:48 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
Connecting -> NetworkFailure )
[Wed Feb 5 21:02:48 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Restarting
sender thread
[Wed Feb 5 21:02:48 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Connection
closed
[Wed Feb 5 21:02:48 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
NetworkFailure -> Unconnected )
[Wed Feb 5 21:02:49 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
Unconnected -> Connecting )
[Wed Feb 5 21:03:01 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Wrong magic
value 0x52464220 in protocol version 86
[Wed Feb 5 21:03:01 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
Connecting -> NetworkFailure )
[Wed Feb 5 21:03:01 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Restarting
sender thread
[Wed Feb 5 21:03:01 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Connection
closed
[Wed Feb 5 21:03:01 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
NetworkFailure -> Unconnected )
[Wed Feb 5 21:03:02 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
Unconnected -> Connecting )
[Wed Feb 5 21:03:11 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Wrong magic
value 0x52464220 in protocol version 86
<last 6 entries repeat for a while>
[Wed Feb 5 21:07:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Restarting
sender thread
[Wed Feb 5 21:07:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Connection
closed
[Wed Feb 5 21:07:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: conn(
Disconnecting -> StandAlone )
[Wed Feb 5 21:07:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Terminating
receiver thread
[Wed Feb 5 21:07:36 2020] drbd one-vm-1385-disk-0 nts-cloud-n102: Terminating
sender thread
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0: Preparing cluster-wide
state change 1186333411 (1->0 496/16)
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0: State change 1186333411:
primary_nodes=0, weak_nodes=0
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Cluster is
now split
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0: Committing cluster-wide
state change 1186333411 (0ms)
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Connected -> Disconnecting ) peer( Secondary -> Unknown )
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0/0 drbd1099: quorum( yes ->
no )
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0/0 drbd1099 nts-cloud-n101:
pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: ack_receiver
terminated
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Terminating
ack_recv thread
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Aborting
remote state change 0 commit not possible
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Restarting
sender thread
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Connection
closed
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: conn(
Disconnecting -> StandAlone )
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Terminating
receiver thread
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0 nts-cloud-n101: Terminating
sender thread
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0/0 drbd1099: disk( UpToDate
-> Detaching )
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0/0 drbd1099: disk( Detaching
-> Diskless )
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0/0 drbd1099: drbd_bm_resize
called with capacity == 0
[Wed Feb 5 21:07:37 2020] drbd one-vm-1385-disk-0: Terminating worker thread
drbdadm status on 100:
one-vm-1385-disk-0 role:Secondary
disk:UpToDate
nts-cloud-n101 role:Secondary
peer-disk:UpToDate
nts-cloud-n102 connection:Connecting
Best of all,
Gleb.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user