Hi all,

I have a 3-node DRBD Cluster that has suffered a Splitbrain. I recovered all 
resources except 1.
For this resource, connections Node3-Node1 and Node3-Node2 are fine, but the 
connection Node1-Node2 is not working, as both sides see the other one as 
Standalone.

***Node 3
[root@pbzne4demo-n3 ~]# drbdadm status influxdb
influxdb role:Primary
  disk:UpToDate
  pbzne4demo-n1.wp.lan role:Secondary
    peer-disk:UpToDate
  pbzne4demo-n2.wp.lan role:Secondary
    peer-disk:UpToDate
***Node 2
[root@pbzne4demo-n2 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n1.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate
***Node1
[root@pbzne4demo-n1 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n2.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate

I tried disconnecting and reconnecting the resource on every node, but the 
standalone always remain on both the same nodes.
What I tried:
1. Disconnect from all nodes, connect on the primary node, connect 
--discard-my-data on both secondary nodes.
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [7948])
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Connection closed
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating receiver thread
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Preparing remote state change 271906619
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Committing remote state change 271906619 (primary_nodes=8)
***Node 1
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [30208])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Handshake to peer 3 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [30210])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Connection closed
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating receiver thread

2. Tried using drbdadm adjust on both the secondary nodes
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:20:01 pbzne4demo-n2 systemd: Started Session 3741 of user root.
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6563])
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [8026])
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Connection closed
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating receiver thread
***Node 1
Jul 16 12:20:01 pbzne4demo-n1 systemd: Started Session 3754 of user root.
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [30273])
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Connection closed
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating receiver thread

3. Disconnect from all nodes, invalidate on both secondary nodes, connect 
primary node then connect on both secondary nodes
Standalone remains.

I think next steps might be working with metadata, but since I am a novice, I'm 
asking for suggestion. Please, can you help me in resolving this issue?
This is not a critical system, I can rebuild it, but I'd like to come up with a 
procedure and a better understanding of how to handle this kind of cases, 
because I'm sure I will encounter it again.


Best regards,
Rocco Pezzani
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to