I already tried disconnecting and reconnecting the resources, also used the 
invalidate command.
Nothing changed.

Before trying to mess with the metadata, I tried a restart of all the drbd 
services and it solved the problem. So I didn’t messed with metadata.

Journalctl showed no differences for service drbd.service between each node.
The only “unusual” thing I noticed, a restart on the first secondary node hung 
until the restart on the second secondary has been done. Here what happened:

1. [Node3] systemctl restart drbd.service; restart OK
2. [Node2] systemctl restart drbd.service; restart hung, but the service seems 
up and running
1. [Node1] systemctl restart drbd.service; restart OK. Restart on Node2 
completed at the same time.

I’ll try to examine the messages log on every node to understand what happened, 
but I don’t think I’ll find something useful.


Meanwhile, Thank you all.

Best regards,
Rocco Pezzani


From: Gianni Milo <[email protected]>
Sent: mercoledì 17 luglio 2019 09:21
To: Pezzani, Rocco <[email protected]>
Cc: [email protected]
Subject: Re: [DRBD-user] 3-Node DRBD with 2 standalone

I would try disconnecting or bringing down the resource either on Node1 or 
Node2. Then write some data on the Primary and finally bring up or connect the 
resource. This should trigger a sync for the newly created data on this 
resource/node.
Last option would be to either invalidate the data of the affected resource on 
either Node1 or Node2 ,or re-create its metadata, but that will trigger a full 
sync, which may not be desirable.
Once you manage to sort this out, consider implementing the quorum feature in 
order to avoid split-brain situations in the future.

Gianni


On Wed, 17 Jul 2019 at 06:31, Pezzani, Rocco 
<[email protected]<mailto:[email protected]>> 
wrote:
Hi all,

I have a 3-node DRBD Cluster that has suffered a Splitbrain. I recovered all 
resources except 1.
For this resource, connections Node3-Node1 and Node3-Node2 are fine, but the 
connection Node1-Node2 is not working, as both sides see the other one as 
Standalone.

***Node 3
[root@pbzne4demo-n3 ~]# drbdadm status influxdb
influxdb role:Primary
  disk:UpToDate
  pbzne4demo-n1.wp.lan role:Secondary
    peer-disk:UpToDate
  pbzne4demo-n2.wp.lan role:Secondary
    peer-disk:UpToDate
***Node 2
[root@pbzne4demo-n2 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n1.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate
***Node1
[root@pbzne4demo-n1 ~]# drbdadm status influxdb
influxdb role:Secondary
  disk:UpToDate
  pbzne4demo-n2.wp.lan connection:StandAlone
  pbzne4demo-n3.wp.lan role:Primary
    peer-disk:UpToDate

I tried disconnecting and reconnecting the resource on every node, but the 
standalone always remain on both the same nodes.
What I tried:
1. Disconnect from all nodes, connect on the primary node, connect 
--discard-my-data on both secondary nodes.
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [7948])
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Connection closed
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating receiver thread
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Preparing remote state change 271906619
Jul 16 12:16:10 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Committing remote state change 271906619 (primary_nodes=8)
***Node 1
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [30208])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Handshake to peer 3 successful: Agreed network protocol version 114
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
ack_receiver terminated
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating ack_recv thread
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n3.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [30210])
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Connection closed
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:16:09 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating receiver thread

2. Tried using drbdadm adjust on both the secondary nodes
Standalone remains.
/var/log/messages reports this on secondary nodes:
***Node 2
Jul 16 12:20:01 pbzne4demo-n2 systemd: Started Session 3741 of user root.
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6563])
Jul 16 12:20:03 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Handshake to peer 1 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [8026])
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Connection closed
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n2 kernel: drbd influxdb pbzne4demo-n1.wp.lan: 
Terminating receiver thread
***Node 1
Jul 16 12:20:01 pbzne4demo-n1 systemd: Started Session 3754 of user root.
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
StandAlone -> Unconnected )
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting receiver thread (from drbd_w_influxdb [6596])
Jul 16 12:20:15 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Unconnected -> Connecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Handshake to peer 2 successful: Agreed network protocol version 114
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME 
WRITE_ZEROES.
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Starting ack_recv thread (from drbd_r_influxdb [30273])
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
incompatible discard-my-data settings
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Connecting -> Disconnecting )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: error 
receiving P_PROTOCOL, e: -5 l: 1!
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
ack_receiver terminated
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating ack_recv thread
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Connection closed
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: conn( 
Disconnecting -> StandAlone )
Jul 16 12:20:16 pbzne4demo-n1 kernel: drbd influxdb pbzne4demo-n2.wp.lan: 
Terminating receiver thread

3. Disconnect from all nodes, invalidate on both secondary nodes, connect 
primary node then connect on both secondary nodes
Standalone remains.

I think next steps might be working with metadata, but since I am a novice, I’m 
asking for suggestion. Please, can you help me in resolving this issue?
This is not a critical system, I can rebuild it, but I’d like to come up with a 
procedure and a better understanding of how to handle this kind of cases, 
because I’m sure I will encounter it again.


Best regards,
Rocco Pezzani
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]<mailto:[email protected]>
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to