Re: [Openais] Problem and Question about corosync

Fabio M. Di Nitto Mon, 18 Nov 2013 05:30:05 -0800

On 11/18/2013 1:41 PM, Moullé Alain wrote:
> Hi,
> 
> with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we got
> a strange and random problem :
> 
> For some reason that we can't identify in the syslog, one node (let's
> say node1) losts the 3 other members node2, node3, node4 (without any
> visible network problems on both heartbeat networks (configured in rrp
> active mode and with a distinct mcast address, and distinct mcast port)  .


rrp active mode is not working properly and untested. That´s probably
why you see odd things. Try with passive mode instead.

If you are using VMs for testing, be aware to check if the host is not
overcommitted and VMs are pausing.

Another issue is that recent kernels broke multicast on bridge network,
so check that you are using udpu for transport.

Fabio

> This node elects itself as a DC (isolated and whereas node2 is already
> DC) until node2 (DC) ask to node3 to fence node1 (probably because it
> detects another DC).
> Main traces are given below.
> When node1 is rebooted and Pacemaker started again, it is again included
> in the HA cluster and all works fine.
> 
> I've checked the changelog of corosync between 1.2.3-36 and 1.4.1-7, but
> there are around 188 bugzilla fixed between both releases, so .... so I
> would like to know if someone in developpment team remembers of a fix
> for such a random problem where a node isolated in the cluster during a
> few seconds elects itself DC and consequently is then fenced by the
> former DC which is in the quorate part of the HA cluster ?
> 
> And also, as workaround or normal but missing tuning, if some tuning
> exists in corosync parameters to avoid a node isolated for a few seconds
> to elect itself as new DC ?
> 
> Thanks a lot for your help.
> Alain Moullé
> 
> I can see in syslog such traces :
> 
> node1 syslog:
> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
> status: node2 is now lost (was member)
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
> status: node3 is now lost (was member)
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
> status: node4 is now lost (was member)
> ...
> 03:28:55 node1 daemon warning crmd [26314]: WARN: check_dead_member: Our
> DC node (node2) left the cluster
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC node2
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking
> over DC status for this partition
> ...
> 03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to
> node1 (3.0.5)
> 
> 
> node2 syslog:
> 03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member: Node
> 704645642/node1 is now: lost
> ...
> 03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member: Node
> 704645642/node1 is now: member
> 
> 
> node3:
> 03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify:
> Peer node1 was terminated (reboot) by node3 for node2
> (ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK
> 
> 
> 
> _______________________________________________
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Problem and Question about corosync

Reply via email to