On Tue, Feb 2, 2010 at 10:47 PM, Steven Dake <[email protected]> wrote: > On Tue, 2010-02-02 at 10:12 +0100, Dominik Klein wrote: >> Hi >> >> the following situation happened on a 2 node corosync/pacemaker cluster >> running 64 bit openSuSE 11.1. According to rpm -q Pacemaker is version >> 1.0.6-1 and Corosync is version 1.1.2-1. >> >> This morning (at about 8:37) one of my cluster nodes stopped responding. >> No ping, no ssh, no service. >> >> Corosync however did not notice that the node was down until I told the >> alive node to shutdown a resource on that node. Immediately after doing >> so, corosync logged: > > Other corosync daemons will always notice a failed node running a > corosync daemon is failing to participate in cluster communication. It > is possible for corosync to still schedule, while other processes are > locked out from interactive processing because corosync runs as a > realtime process. If it consumes all cpu time in a spinlock, it will > still schedule itself, but fail to schedule other non rt processes in > the system.
My problem with that theory is that its corosync that first notices the node is unavailable/failed. Only after Corosync notices does Pacemaker react - its not Pacemaker detecting that its peer process(es) are not responding. But it seems that Corosync only notices once Pacemaker asks it to send a cluster message. Anyway, lets see if Dominik can reproduce with 1.2 > > My first guess is that this is what happened to you as a result of the > root cause of the patch fixed by revision 2558. I recommend trying > corosync 1.2.0, or backporting revision 2558 from branches/wilson into > your environment. > > If your problem persists, please file a defect. > > Also, please note that corosync 1.1.2 has a problem with timestamp: on > and the use of pacemaker because of thread safety issues. A workaround > is to use timestamp: off, or to use corosync 1.2.0, or backport wilson > revision 2626 into your environment. > > Regards > -steve >> >> Feb 2 08:58:06 inacd-db-srv04 corosync[9613]: [TOTEM ] A processor >> failed, forming new configuration. >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] notice: >> pcmk_peer_update: Transitional membership event on ring 196: memb=1, >> new=0, lost=1 >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: >> pcmk_peer_update: memb: inacd-db-srv04 47 >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: >> pcmk_peer_update: lost: inacd-db-srv03 46 >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] notice: >> pcmk_peer_update: Stable membership event on ring 196: memb=1, new=0, lost=0 >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: >> pcmk_peer_update: MEMB: inacd-db-srv04 47 >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: >> ais_mark_unseen_peer_dead: Node inacd-db-srv03 was not seen in the >> previous transition >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: >> update_member: Node 46/inacd-db-srv03 is now: lost >> Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: >> send_member_notification: Sending membership update 196 to 2 children >> >> Notice there are 20 minutes in between the node stopping to respond >> until I managed to tell the node to shutdown something. So for 20 >> minutes, corosync failed to determine that there was a problem with the >> node. >> >> So corosync reports to pacemaker that the node is lost, and pacemaker >> acts promptly and resolves the situation (by rebooting the dead node and >> starting things on the good node). >> >> What I can see in my snmp management station is that on the lost node, >> the load was increasing from about 8:00 until it stopped responding. The >> last record I have (8:35) says loadavg1 around 20 (on a quad core system). >> >> I'll attach corosync.conf >> >> If you need more information, please let me know. >> >> Is this something that would be fixed by installing 1.2.0? >> >> Regards >> Dominik >> _______________________________________________ >> Openais mailing list >> [email protected] >> https://lists.linux-foundation.org/mailman/listinfo/openais > > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
