Alain, Moullé Alain napsal(a): > Hi Jan, > > If you don't mind , I need more precisions about 1.2.3-36: > I just check again the man of corosync.conf in this release, and the > rrp_mode is described with its 3 possible values : none, passive and > active . > So "without official support for rrp" : > ok "not officially supported" but does it really mean that it was not > fully working correctly ?
Yes. It was not working correctly. Actually, we've put huge effort to make it work (like 1/2 year of work). > that there could be randomly a bad behavior such as the one I described > below ? Yes > > And also, "was last release without official support for RRP" , does > that mean that the 1.4.0-1 gives a fully operationnal rrp mode ? RHEL 6.1 was 1.2.3-36. Next RHEL, so RHEL 6.2, was 1.4.1-4. Keep in mind that RRP in RHEL 6.2 is TechPreview, so again no support. > > I'm a little bit afraid to upgrade to the latest 1.4 corosync release on > a RHEL 6.1 ... is there no risk of compatibility with other HA rpm > (pacemaker.1.1.5-5 , cluster-glue-1.0.5-2, etc.) ? > There should be no problems BUT you are on your own (what you are anyway, because EUS for RHEL 6.1 was retired on May 31, 2013). I would recommend you (if possible) to really consider updating to latest RHEL (probably wait for 6.5 where again A HUGE amount of fixes are available). Another possibility may be to not use RRP and consider bonding. Regards, Honza > Thanks a lot for your help. > Alain Moullé > > > Le 18/11/2013 15:50, Jan Friesse a écrit : >> Moullé Alain napsal(a): >>> Hi, >>> >>> with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we got >> 1.2.3-36 is problem. This was last release WITHOUT official support >> for RRP. >> >>> a strange and random problem : >>> >>> For some reason that we can't identify in the syslog, one node (let's >>> say node1) losts the 3 other members node2, node3, node4 (without any >>> visible network problems on both heartbeat networks (configured in rrp >>> active mode and with a distinct mcast address, and distinct mcast >>> port) . >>> This node elects itself as a DC (isolated and whereas node2 is already >>> DC) until node2 (DC) ask to node3 to fence node1 (probably because it >>> detects another DC). >>> Main traces are given below. >>> When node1 is rebooted and Pacemaker started again, it is again included >>> in the HA cluster and all works fine. >>> >>> I've checked the changelog of corosync between 1.2.3-36 and 1.4.1-7, but >>> there are around 188 bugzilla fixed between both releases, so .... so I >> Exactly. Actually, RRP in 1.4 is totally different then RRP in 1.2.3. No >> chance for simple "fix". Just upgrade to latest 1.4. >> >>> would like to know if someone in developpment team remembers of a fix >>> for such a random problem where a node isolated in the cluster during a >>> few seconds elects itself DC and consequently is then fenced by the >>> former DC which is in the quorate part of the HA cluster ? >>> >>> And also, as workaround or normal but missing tuning, if some tuning >>> exists in corosync parameters to avoid a node isolated for a few seconds >>> to elect itself as new DC ? >>> >>> Thanks a lot for your help. >>> Alain Moullé >>> >> Regards, >> Honza >> >>> I can see in syslog such traces : >>> >>> node1 syslog: >>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: >>> status: node2 is now lost (was member) >>> ... >>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: >>> status: node3 is now lost (was member) >>> ... >>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: >>> status: node4 is now lost (was member) >>> ... >>> 03:28:55 node1 daemon warning crmd [26314]: WARN: check_dead_member: Our >>> DC node (node2) left the cluster >>> ... >>> 03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC node2 >>> ... >>> 03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking >>> over DC status for this partition >>> ... >>> 03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to >>> node1 (3.0.5) >>> >>> >>> node2 syslog: >>> 03:29:05 node2 daemon info corosync [pcmk ] info: update_member: Node >>> 704645642/node1 is now: lost >>> ... >>> 03:29:05 node2 daemon info corosync [pcmk ] info: update_member: Node >>> 704645642/node1 is now: member >>> >>> >>> node3: >>> 03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify: >>> Peer node1 was terminated (reboot) by node3 for node2 >>> (ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK >>> >>> >>> >>> _______________________________________________ >>> Openais mailing list >>> [email protected] >>> https://lists.linuxfoundation.org/mailman/listinfo/openais > _______________________________________________ Openais mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/openais
