Hi Tom (and Andrew), I figured out an easy fix for the problem that I encountered. However, there would seem to be a problem lurking in the code.
Here is what I found. On one of the servers that was online and hosting resources: r2lead1:~ # netstat -a | grep crm Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 18659 /var/run/crm/st_command unix 2 [ ACC ] STREAM LISTENING 18826 /var/run/crm/cib_rw unix 2 [ ACC ] STREAM LISTENING 19373 /var/run/crm/crmd unix 2 [ ACC ] STREAM LISTENING 18675 /var/run/crm/attrd unix 2 [ ACC ] STREAM LISTENING 18694 /var/run/crm/pengine unix 2 [ ACC ] STREAM LISTENING 18824 /var/run/crm/cib_callback unix 2 [ ACC ] STREAM LISTENING 18825 /var/run/crm/cib_ro unix 2 [ ACC ] STREAM LISTENING 18662 /var/run/crm/st_callback unix 3 [ ] STREAM CONNECTED 20659 /var/run/crm/cib_callback unix 3 [ ] STREAM CONNECTED 20656 /var/run/crm/cib_rw unix 3 [ ] STREAM CONNECTED 19952 /var/run/crm/attrd unix 3 [ ] STREAM CONNECTED 19944 /var/run/crm/st_callback unix 3 [ ] STREAM CONNECTED 19941 /var/run/crm/st_command unix 3 [ ] STREAM CONNECTED 19359 /var/run/crm/cib_callback unix 3 [ ] STREAM CONNECTED 19356 /var/run/crm/cib_rw unix 3 [ ] STREAM CONNECTED 19353 /var/run/crm/cib_callback unix 3 [ ] STREAM CONNECTED 19350 /var/run/crm/cib_rw On the node that was failing to join the HA cluster, this command returned nothing. However, on one of the functioning servers the above stream information was returned, but included an additional ** 941 ** instances of the following (with different I-Node numbers): unix 3 [ ] STREAM CONNECTED 1238243 /var/run/crm/pengine unix 3 [ ] STREAM CONNECTED 1237524 /var/run/crm/pengine unix 3 [ ] STREAM CONNECTED 1236698 /var/run/crm/pengine unix 3 [ ] STREAM CONNECTED 1235930 /var/run/crm/pengine unix 3 [ ] STREAM CONNECTED 1235094 /var/run/crm/pengine Here is how I corrected the situation: service openais stop on the 941 pengine stream system; service openais restart on the server that was failing to join the HA cluster. Results: The previously failing server joined the HA cluster and supports migration of resources to that server. service openais start of the server that had had the 941 pengine streams and that too came online. Regards, Bob Haxo On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote: > So, Tom ...how do you get the failed node online? > > I've re-installed with the same image that is running on three other > nodes, but still fails. This node was quite happy for the past 3 > months. As I'm testing installs, this and other nodes have been > installed a significant number of times without this sort of failure. > I'd whack the whole HA cluster ... except that I don't want to run into > this failure again without better solution than "reinstall the > system" ;-) > > I'm looking at the information retuned with corosync debug enabled. > After startup, everything looks fine to me until hitting this apparent > local ipc delivery failure: > > Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3 > Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to > pending delivery queue > Jan 13 10:09:10 corosync [pcmk ] WARN: route_ais_message: Sending message to > local.crmd failed: ipc delivery failed (rc=-2) > Jan 13 10:09:10 corosync [pcmk ] Msg[6486] (dest=local:crmd, > from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv > origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref > Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue > > Guess that I'll have to renew my acquaintance with ipc. > > Bob Haxo > > > > On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote: > > I don't know. I still have this issue (and it seems, that I'm not the > > only one...). I'll have a look, if there are pacemaker-updates through > > the zypper-update-channel available (sles11-sp1). > > > > Regards, > > Tom > > > > > > 2011/1/13 Bob Haxo <bh...@sgi.com>: > > > Tom, others, > > > > > > Please, what was the solution to this issue? > > > > > > Thanks, > > > Bob Haxo > > > > > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote: > > > > > > Yes, corosync is running after the reboot. It comes up with the > > > regular init-procedure (runlevel 3 in my case). > > > > > > 2010/9/6 Andrew Beekhof <and...@beekhof.net>: > > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtu...@gmail.com> wrote: > > >>> No, I don't have such failed-messages. In my case, the "Connection to > > >>> our AIS plugin" was established. > > >>> > > >>> The /dev/shm is also not full. > > >> > > >> Is corosync running? > > >> > > >>> Kind regards, > > >>> Tom > > >>> > > >>> 2010/9/3 Michael Smith <msm...@cbnco.com>: > > >>>> Tom Tux wrote: > > >>>> > > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes > > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not join > > >>>>> himself automatically into the cluster. After the reboot, I have the > > >>>>> following error- and warn-messages in the log: > > >>>>> > > >>>>> Sep 3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: live > > >>>> > > >>>> Do you have messages like this, too? > > >>>> > > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]: [IPC ] Invalid IPC > > >>>> credentials. > > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection: > > >>>> Connection to our AIS plugin (9) failed: unknown (100) > > >>>> > > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in > > >>>> to > > >>>> the cluster... terminating > > >>>> > > >>>> > > >>>> > > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e > > >>>> > > >>>> Mike > > >>>> > > >>>> _______________________________________________ > > >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > >>>> > > >>>> Project Home: http://www.clusterlabs.org > > >>>> Getting started: > > >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > >>>> Bugs: > > >>>> > > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > >>>> > > >>> > > >>> _______________________________________________ > > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > >>> > > >>> Project Home: http://www.clusterlabs.org > > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > >>> Bugs: > > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > >>> > > >> > > >> _______________________________________________ > > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > >> > > >> Project Home: http://www.clusterlabs.org > > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > >> Bugs: > > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > >> > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker