Hi all,
I have a 3-node DRBD Cluster that has suffered a Splitbrain. I recovered all
resources except 1.
For this resource, connections Node3-Node1 and Node3-Node2 are fine, but the
connection Node1-Node2 is not working, as both sides see the other one as
Standalone.
***Node 3
[root@pbzne4demo-n3 ~]# drbdadm status influxdb
influxdb role:Primary
disk:UpToDate
pbzne4demo-n1.wp.lan role:Secondary
peer-disk:UpToDate
pbzne4demo-n2.wp.lan role:Secondary
peer-disk:UpToDate
***Node 2
[root@pbzne4demo-n2 ~]# drbdadm status influxdb
influxdb role:Secondary
disk:UpToDate
pbzne4demo-n1.wp.lan connection:StandAlone
pbzne4demo-n3.wp.lan role:Primary
peer-disk:UpToDate
***Node1
[root@pbzne4demo-n1 ~]# drbdadm status influxdb
influxdb role:Secondary
disk:UpToDate
pbzne4demo-n2.wp.lan connection:StandAlone
pbzne4demo-n3.wp.lan role:Primary
peer-disk:UpToDate
I tried disconnecting and reconnecting the resource on every node, but the
standalone always remain on both the same nodes.
What I tried:
1. Disconnect from all nodes, connect on the primary node, connect
--discard-my-data on both secondary nodes.
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Starting ack_recv thread (from drbd_r_influxdb [7948])
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn(
Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Connection closed
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn(
Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Terminating receiver thread
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
Preparing remote state change 271906619
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
Committing remote state change 271906619 (primary_nodes=8)
***Node 1
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn(
StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn(
Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Starting ack_recv thread (from drbd_r_influxdb [30208])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
Handshake to peer 3 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan:
Starting ack_recv thread (from drbd_r_influxdb [30210])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Connection closed
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Terminating receiver thread
2. Tried using drbdadm adjust on both the secondary nodes
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:20:01 pbzne4demo-n2 systemd: Started Session 3741 of user root.
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn(
StandAlone -> Unconnected )
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Starting receiver thread (from drbd_w_influxdb [6563])
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn(
Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Starting ack_recv thread (from drbd_r_influxdb [8026])
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn(
Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Connection closed
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn(
Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan:
Terminating receiver thread
***Node 1
Jul 16 12:20:01 pbzne4demo-n1 systemd: Started Session 3754 of user root.
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
StandAlone -> Unconnected )
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Starting ack_recv thread (from drbd_r_influxdb [30273])
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Connection closed
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn(
Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan:
Terminating receiver thread
3. Disconnect from all nodes, invalidate on both secondary nodes, connect
primary node then connect on both secondary nodes
Standalone remains.
I think next steps might be working with metadata, but since I am a novice, I'm
asking for suggestion. Please, can you help me in resolving this issue?
This is not a critical system, I can rebuild it, but I'd like to come up with a
procedure and a better understanding of how to handle this kind of cases,
because I'm sure I will encounter it again.
Best regards,
Rocco Pezzani
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user