Moullé Alain napsal(a):
> Hi,
> 
> with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we got

1.2.3-36 is problem. This was last release WITHOUT official support for RRP.

> a strange and random problem :
> 
> For some reason that we can't identify in the syslog, one node (let's
> say node1) losts the 3 other members node2, node3, node4 (without any
> visible network problems on both heartbeat networks (configured in rrp
> active mode and with a distinct mcast address, and distinct mcast port)  .
> This node elects itself as a DC (isolated and whereas node2 is already
> DC) until node2 (DC) ask to node3 to fence node1 (probably because it
> detects another DC).
> Main traces are given below.
> When node1 is rebooted and Pacemaker started again, it is again included
> in the HA cluster and all works fine.
> 
> I've checked the changelog of corosync between 1.2.3-36 and 1.4.1-7, but
> there are around 188 bugzilla fixed between both releases, so .... so I

Exactly. Actually, RRP in 1.4 is totally different then RRP in 1.2.3. No
chance for simple "fix". Just upgrade to latest 1.4.

> would like to know if someone in developpment team remembers of a fix
> for such a random problem where a node isolated in the cluster during a
> few seconds elects itself DC and consequently is then fenced by the
> former DC which is in the quorate part of the HA cluster ?
> 
> And also, as workaround or normal but missing tuning, if some tuning
> exists in corosync parameters to avoid a node isolated for a few seconds
> to elect itself as new DC ?
> 
> Thanks a lot for your help.
> Alain Moullé
> 

Regards,
  Honza

> I can see in syslog such traces :
> 
> node1 syslog:
> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
> status: node2 is now lost (was member)
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
> status: node3 is now lost (was member)
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
> status: node4 is now lost (was member)
> ...
> 03:28:55 node1 daemon warning crmd [26314]: WARN: check_dead_member: Our
> DC node (node2) left the cluster
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC node2
> ...
> 03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking
> over DC status for this partition
> ...
> 03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to
> node1 (3.0.5)
> 
> 
> node2 syslog:
> 03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member: Node
> 704645642/node1 is now: lost
> ...
> 03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member: Node
> 704645642/node1 is now: member
> 
> 
> node3:
> 03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify:
> Peer node1 was terminated (reboot) by node3 for node2
> (ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK
> 
> 
> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais

Reply via email to