On Tue, 2010-02-02 at 10:12 +0100, Dominik Klein wrote: > Hi > > the following situation happened on a 2 node corosync/pacemaker cluster > running 64 bit openSuSE 11.1. According to rpm -q Pacemaker is version > 1.0.6-1 and Corosync is version 1.1.2-1. > > This morning (at about 8:37) one of my cluster nodes stopped responding. > No ping, no ssh, no service. > > Corosync however did not notice that the node was down until I told the > alive node to shutdown a resource on that node. Immediately after doing > so, corosync logged:
Other corosync daemons will always notice a failed node running a corosync daemon is failing to participate in cluster communication. It is possible for corosync to still schedule, while other processes are locked out from interactive processing because corosync runs as a realtime process. If it consumes all cpu time in a spinlock, it will still schedule itself, but fail to schedule other non rt processes in the system. My first guess is that this is what happened to you as a result of the root cause of the patch fixed by revision 2558. I recommend trying corosync 1.2.0, or backporting revision 2558 from branches/wilson into your environment. If your problem persists, please file a defect. Also, please note that corosync 1.1.2 has a problem with timestamp: on and the use of pacemaker because of thread safety issues. A workaround is to use timestamp: off, or to use corosync 1.2.0, or backport wilson revision 2626 into your environment. Regards -steve > > Feb 2 08:58:06 inacd-db-srv04 corosync[9613]: [TOTEM ] A processor > failed, forming new configuration. > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] notice: > pcmk_peer_update: Transitional membership event on ring 196: memb=1, > new=0, lost=1 > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: > pcmk_peer_update: memb: inacd-db-srv04 47 > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: > pcmk_peer_update: lost: inacd-db-srv03 46 > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] notice: > pcmk_peer_update: Stable membership event on ring 196: memb=1, new=0, lost=0 > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: > pcmk_peer_update: MEMB: inacd-db-srv04 47 > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node inacd-db-srv03 was not seen in the > previous transition > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: > update_member: Node 46/inacd-db-srv03 is now: lost > Feb 2 08:58:09 inacd-db-srv04 corosync[9613]: [pcmk ] info: > send_member_notification: Sending membership update 196 to 2 children > > Notice there are 20 minutes in between the node stopping to respond > until I managed to tell the node to shutdown something. So for 20 > minutes, corosync failed to determine that there was a problem with the > node. > > So corosync reports to pacemaker that the node is lost, and pacemaker > acts promptly and resolves the situation (by rebooting the dead node and > starting things on the good node). > > What I can see in my snmp management station is that on the lost node, > the load was increasing from about 8:00 until it stopped responding. The > last record I have (8:35) says loadavg1 around 20 (on a quad core system). > > I'll attach corosync.conf > > If you need more information, please let me know. > > Is this something that would be fixed by installing 1.2.0? > > Regards > Dominik > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
