Moullé Alain napsal(a): > Hi, > > with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we got
1.2.3-36 is problem. This was last release WITHOUT official support for RRP. > a strange and random problem : > > For some reason that we can't identify in the syslog, one node (let's > say node1) losts the 3 other members node2, node3, node4 (without any > visible network problems on both heartbeat networks (configured in rrp > active mode and with a distinct mcast address, and distinct mcast port) . > This node elects itself as a DC (isolated and whereas node2 is already > DC) until node2 (DC) ask to node3 to fence node1 (probably because it > detects another DC). > Main traces are given below. > When node1 is rebooted and Pacemaker started again, it is again included > in the HA cluster and all works fine. > > I've checked the changelog of corosync between 1.2.3-36 and 1.4.1-7, but > there are around 188 bugzilla fixed between both releases, so .... so I Exactly. Actually, RRP in 1.4 is totally different then RRP in 1.2.3. No chance for simple "fix". Just upgrade to latest 1.4. > would like to know if someone in developpment team remembers of a fix > for such a random problem where a node isolated in the cluster during a > few seconds elects itself DC and consequently is then fenced by the > former DC which is in the quorate part of the HA cluster ? > > And also, as workaround or normal but missing tuning, if some tuning > exists in corosync parameters to avoid a node isolated for a few seconds > to elect itself as new DC ? > > Thanks a lot for your help. > Alain Moullé > Regards, Honza > I can see in syslog such traces : > > node1 syslog: > 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: > status: node2 is now lost (was member) > ... > 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: > status: node3 is now lost (was member) > ... > 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: > status: node4 is now lost (was member) > ... > 03:28:55 node1 daemon warning crmd [26314]: WARN: check_dead_member: Our > DC node (node2) left the cluster > ... > 03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC node2 > ... > 03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking > over DC status for this partition > ... > 03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to > node1 (3.0.5) > > > node2 syslog: > 03:29:05 node2 daemon info corosync [pcmk ] info: update_member: Node > 704645642/node1 is now: lost > ... > 03:29:05 node2 daemon info corosync [pcmk ] info: update_member: Node > 704645642/node1 is now: member > > > node3: > 03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify: > Peer node1 was terminated (reboot) by node3 for node2 > (ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK > > > > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linuxfoundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/openais
