Re: [Openais] Node lost detection problem in corosync 1.1.2-1

Steven Dake Tue, 02 Feb 2010 14:00:23 -0800

On Tue, 2010-02-02 at 10:12 +0100, Dominik Klein wrote:
> Hi
> 
> the following situation happened on a 2 node corosync/pacemaker cluster
> running 64 bit openSuSE 11.1. According to rpm -q Pacemaker is version
> 1.0.6-1 and Corosync is version 1.1.2-1.
> 
> This morning (at about 8:37) one of my cluster nodes stopped responding.
> No ping, no ssh, no service.
> 
> Corosync however did not notice that the node was down until I told the
> alive node to shutdown a resource on that node. Immediately after doing
> so, corosync logged:


Other corosync daemons will always notice a failed node running a
corosync daemon is failing to participate in cluster communication.  It
is possible for corosync to still schedule, while other processes are
locked out from interactive processing because corosync runs as a
realtime process.  If it consumes all cpu time in a spinlock, it will
still schedule itself, but fail to schedule other non rt processes in
the system.

My first guess is that this is what happened to you as a result of the
root cause of the patch fixed by revision 2558.  I recommend trying
corosync 1.2.0, or backporting revision 2558 from branches/wilson into
your environment.

If your problem persists, please file a defect.

Also, please note that corosync 1.1.2 has a problem with timestamp: on
and the use of pacemaker because of thread safety issues.  A workaround
is to use timestamp: off, or to use corosync 1.2.0, or backport wilson
revision 2626 into your environment.

Regards
-steve
> 
> Feb  2 08:58:06 inacd-db-srv04 corosync[9613]:   [TOTEM ] A processor
> failed, forming new configuration.
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 196: memb=1,
> new=0, lost=1
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
> pcmk_peer_update: memb: inacd-db-srv04 47
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
> pcmk_peer_update: lost: inacd-db-srv03 46
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] notice:
> pcmk_peer_update: Stable membership event on ring 196: memb=1, new=0, lost=0
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
> pcmk_peer_update: MEMB: inacd-db-srv04 47
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
> ais_mark_unseen_peer_dead: Node inacd-db-srv03 was not seen in the
> previous transition
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
> update_member: Node 46/inacd-db-srv03 is now: lost
> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
> send_member_notification: Sending membership update 196 to 2 children
> 
> Notice there are 20 minutes in between the node stopping to respond
> until I managed to tell the node to shutdown something. So for 20
> minutes, corosync failed to determine that there was a problem with the
> node.
> 
> So corosync reports to pacemaker that the node is lost, and pacemaker
> acts promptly and resolves the situation (by rebooting the dead node and
> starting things on the good node).
> 
> What I can see in my snmp management station is that on the lost node,
> the load was increasing from about 8:00 until it stopped responding. The
> last record I have (8:35) says loadavg1 around 20 (on a quad core system).
> 
> I'll attach corosync.conf
> 
> If you need more information, please let me know.
> 
> Is this something that would be fixed by installing 1.2.0?
> 
> Regards
> Dominik
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Node lost detection problem in corosync 1.1.2-1

Reply via email to