Re: [Openais] Problem and Question about corosync

Moullé Alain Tue, 19 Nov 2013 01:45:20 -0800

And just to complete the "résumé" :

"1.4.1 - Only passive mode is fully supported."


ok but in which release is the active mode fully operationnal (even if not 
officially supported) ?

Thanks
Alain



Le 19/11/2013 10:26, Jan Friesse a écrit :

Moullé Alain napsal(a):

Hi again,

and thanks for all information.

Last one : "Another possibility may be to not use RRP and consider
bonding."
So perhaps I did not completely understand : do you mean that even rrp
mode set to "passive" will not work correctly ?

Yes. With 1.2.3 it will not work.

I understood that it was the "active" mode which was not fully
operationnal, so I was about to set it to passive to avoid the problems ...

Active mode is unsupported even with 1.4.1.

Let me summarize it for you:
1.2.3 - Just don't use it. If you are using it, you may experience
various problems. If you really need to use it, please consider bonding.
1.4.1 - Only passive mode is fully supported.

Honza

Alain

Le 19/11/2013 09:54, Jan Friesse a écrit :

Alain,

Moullé Alain napsal(a):

Hi Jan,

If you don't mind , I need more precisions about 1.2.3-36:
I just check again the man of corosync.conf in this release, and the
rrp_mode is described with its 3 possible values : none, passive and
active .
So "without official support for rrp" :
ok "not officially supported" but does it really mean that it was not
fully working correctly ?

Yes. It was not working correctly. Actually, we've put huge effort to
make it work (like 1/2 year of work).

that there could be randomly a bad behavior such as the one I described
below ?

Yes

And also, "was last release without official support for RRP" , does
that mean that the 1.4.0-1 gives a fully operationnal rrp mode ?

RHEL 6.1 was 1.2.3-36. Next RHEL, so RHEL 6.2, was 1.4.1-4. Keep in mind
that RRP in RHEL 6.2 is TechPreview, so again no support.

I'm a little bit afraid to upgrade to the latest 1.4 corosync release on
a RHEL 6.1 ... is there no risk of compatibility with other HA rpm
(pacemaker.1.1.5-5 , cluster-glue-1.0.5-2, etc.) ?

There should be no problems BUT you are on your own (what you are
anyway, because EUS for RHEL 6.1 was retired on
May 31, 2013).

I would recommend you (if possible) to really consider updating to
latest RHEL (probably wait for 6.5 where again A HUGE amount of fixes
are available).

Another possibility may be to not use RRP and consider bonding.


Regards,
    Honza

Thanks a lot for your help.
Alain Moullé


Le 18/11/2013 15:50, Jan Friesse a écrit :

Moullé Alain napsal(a):

Hi,

with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we
got

1.2.3-36 is problem. This was last release WITHOUT official support
for RRP.

a strange and random problem :

For some reason that we can't identify in the syslog, one node (let's
say node1) losts the 3 other members node2, node3, node4 (without any
visible network problems on both heartbeat networks (configured in rrp
active mode and with a distinct mcast address, and distinct mcast
port)  .
This node elects itself as a DC (isolated and whereas node2 is already
DC) until node2 (DC) ask to node3 to fence node1 (probably because it
detects another DC).
Main traces are given below.
When node1 is rebooted and Pacemaker started again, it is again
included
in the HA cluster and all works fine.

I've checked the changelog of corosync between 1.2.3-36 and
1.4.1-7, but
there are around 188 bugzilla fixed between both releases, so ....
so I

Exactly. Actually, RRP in 1.4 is totally different then RRP in
1.2.3. No
chance for simple "fix". Just upgrade to latest 1.4.

would like to know if someone in developpment team remembers of a fix
for such a random problem where a node isolated in the cluster
during a
few seconds elects itself DC and consequently is then fenced by the
former DC which is in the quorate part of the HA cluster ?

And also, as workaround or normal but missing tuning, if some tuning
exists in corosync parameters to avoid a node isolated for a few
seconds
to elect itself as new DC ?

Thanks a lot for your help.
Alain Moullé

Regards,
     Honza

I can see in syslog such traces :

node1 syslog:
03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
status: node2 is now lost (was member)
...
03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
status: node3 is now lost (was member)
...
03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
status: node4 is now lost (was member)
...
03:28:55 node1 daemon warning crmd [26314]: WARN:
check_dead_member: Our
DC node (node2) left the cluster
...
03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC
node2
...
03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking
over DC status for this partition
...
03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to
node1 (3.0.5)


node2 syslog:
03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member:
Node
704645642/node1 is now: lost
...
03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member:
Node
704645642/node1 is now: member


node3:
03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify:
Peer node1 was terminated (reboot) by node3 for node2
(ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK



_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Problem and Question about corosync

Reply via email to