Moullé Alain napsal(a): > Hi again, > > and thanks for all information. > > Last one : "Another possibility may be to not use RRP and consider > bonding." > So perhaps I did not completely understand : do you mean that even rrp > mode set to "passive" will not work correctly ?
Yes. With 1.2.3 it will not work. > I understood that it was the "active" mode which was not fully > operationnal, so I was about to set it to passive to avoid the problems ... > Active mode is unsupported even with 1.4.1. Let me summarize it for you: 1.2.3 - Just don't use it. If you are using it, you may experience various problems. If you really need to use it, please consider bonding. 1.4.1 - Only passive mode is fully supported. Honza > Alain > > Le 19/11/2013 09:54, Jan Friesse a écrit : >> Alain, >> >> Moullé Alain napsal(a): >>> Hi Jan, >>> >>> If you don't mind , I need more precisions about 1.2.3-36: >>> I just check again the man of corosync.conf in this release, and the >>> rrp_mode is described with its 3 possible values : none, passive and >>> active . >>> So "without official support for rrp" : >>> ok "not officially supported" but does it really mean that it was not >>> fully working correctly ? >> Yes. It was not working correctly. Actually, we've put huge effort to >> make it work (like 1/2 year of work). >> >>> that there could be randomly a bad behavior such as the one I described >>> below ? >> Yes >> >>> And also, "was last release without official support for RRP" , does >>> that mean that the 1.4.0-1 gives a fully operationnal rrp mode ? >> RHEL 6.1 was 1.2.3-36. Next RHEL, so RHEL 6.2, was 1.4.1-4. Keep in mind >> that RRP in RHEL 6.2 is TechPreview, so again no support. >> >>> I'm a little bit afraid to upgrade to the latest 1.4 corosync release on >>> a RHEL 6.1 ... is there no risk of compatibility with other HA rpm >>> (pacemaker.1.1.5-5 , cluster-glue-1.0.5-2, etc.) ? >>> >> There should be no problems BUT you are on your own (what you are >> anyway, because EUS for RHEL 6.1 was retired on >> May 31, 2013). >> >> I would recommend you (if possible) to really consider updating to >> latest RHEL (probably wait for 6.5 where again A HUGE amount of fixes >> are available). >> >> Another possibility may be to not use RRP and consider bonding. >> >> >> Regards, >> Honza >> >>> Thanks a lot for your help. >>> Alain Moullé >>> >>> >>> Le 18/11/2013 15:50, Jan Friesse a écrit : >>>> Moullé Alain napsal(a): >>>>> Hi, >>>>> >>>>> with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we >>>>> got >>>> 1.2.3-36 is problem. This was last release WITHOUT official support >>>> for RRP. >>>> >>>>> a strange and random problem : >>>>> >>>>> For some reason that we can't identify in the syslog, one node (let's >>>>> say node1) losts the 3 other members node2, node3, node4 (without any >>>>> visible network problems on both heartbeat networks (configured in rrp >>>>> active mode and with a distinct mcast address, and distinct mcast >>>>> port) . >>>>> This node elects itself as a DC (isolated and whereas node2 is already >>>>> DC) until node2 (DC) ask to node3 to fence node1 (probably because it >>>>> detects another DC). >>>>> Main traces are given below. >>>>> When node1 is rebooted and Pacemaker started again, it is again >>>>> included >>>>> in the HA cluster and all works fine. >>>>> >>>>> I've checked the changelog of corosync between 1.2.3-36 and >>>>> 1.4.1-7, but >>>>> there are around 188 bugzilla fixed between both releases, so .... >>>>> so I >>>> Exactly. Actually, RRP in 1.4 is totally different then RRP in >>>> 1.2.3. No >>>> chance for simple "fix". Just upgrade to latest 1.4. >>>> >>>>> would like to know if someone in developpment team remembers of a fix >>>>> for such a random problem where a node isolated in the cluster >>>>> during a >>>>> few seconds elects itself DC and consequently is then fenced by the >>>>> former DC which is in the quorate part of the HA cluster ? >>>>> >>>>> And also, as workaround or normal but missing tuning, if some tuning >>>>> exists in corosync parameters to avoid a node isolated for a few >>>>> seconds >>>>> to elect itself as new DC ? >>>>> >>>>> Thanks a lot for your help. >>>>> Alain Moullé >>>>> >>>> Regards, >>>> Honza >>>> >>>>> I can see in syslog such traces : >>>>> >>>>> node1 syslog: >>>>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: >>>>> status: node2 is now lost (was member) >>>>> ... >>>>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: >>>>> status: node3 is now lost (was member) >>>>> ... >>>>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback: >>>>> status: node4 is now lost (was member) >>>>> ... >>>>> 03:28:55 node1 daemon warning crmd [26314]: WARN: >>>>> check_dead_member: Our >>>>> DC node (node2) left the cluster >>>>> ... >>>>> 03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC >>>>> node2 >>>>> ... >>>>> 03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking >>>>> over DC status for this partition >>>>> ... >>>>> 03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to >>>>> node1 (3.0.5) >>>>> >>>>> >>>>> node2 syslog: >>>>> 03:29:05 node2 daemon info corosync [pcmk ] info: update_member: >>>>> Node >>>>> 704645642/node1 is now: lost >>>>> ... >>>>> 03:29:05 node2 daemon info corosync [pcmk ] info: update_member: >>>>> Node >>>>> 704645642/node1 is now: member >>>>> >>>>> >>>>> node3: >>>>> 03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify: >>>>> Peer node1 was terminated (reboot) by node3 for node2 >>>>> (ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Openais mailing list >>>>> Openais@lists.linux-foundation.org >>>>> https://lists.linuxfoundation.org/mailman/listinfo/openais > _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/openais