Re: [Openais] Node lost detection problem in corosync 1.1.2-1

Dominik Klein Tue, 02 Feb 2010 23:40:59 -0800

Steven Dake wrote:
> On Tue, 2010-02-02 at 10:12 +0100, Dominik Klein wrote:
>> Hi
>>
>> the following situation happened on a 2 node corosync/pacemaker cluster
>> running 64 bit openSuSE 11.1. According to rpm -q Pacemaker is version
>> 1.0.6-1 and Corosync is version 1.1.2-1.
>>
>> This morning (at about 8:37) one of my cluster nodes stopped responding.
>> No ping, no ssh, no service.
>>
>> Corosync however did not notice that the node was down until I told the
>> alive node to shutdown a resource on that node. Immediately after doing
>> so, corosync logged:
> 
> Other corosync daemons will always notice a failed node running a
> corosync daemon is failing to participate in cluster communication.  It
> is possible for corosync to still schedule, while other processes are
> locked out from interactive processing because corosync runs as a
> realtime process.  If it consumes all cpu time in a spinlock, it will
> still schedule itself, but fail to schedule other non rt processes in
> the system.


Makes sense. And I could also reproduce this in my lab. Producing heavy
load and then looking at the network traffic showed corosync happily
working while anything else on the system was pretty much unavailable.

> My first guess is that this is what happened to you as a result of the
> root cause of the patch fixed by revision 2558.  I recommend trying
> corosync 1.2.0, or backporting revision 2558 from branches/wilson into
> your environment.

Will do and report back. Thanks.

Regards
Dominik

> If your problem persists, please file a defect.
> 
> Also, please note that corosync 1.1.2 has a problem with timestamp: on
> and the use of pacemaker because of thread safety issues.  A workaround
> is to use timestamp: off, or to use corosync 1.2.0, or backport wilson
> revision 2626 into your environment.
> 
> Regards
> -steve
>> Feb  2 08:58:06 inacd-db-srv04 corosync[9613]:   [TOTEM ] A processor
>> failed, forming new configuration.
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] notice:
>> pcmk_peer_update: Transitional membership event on ring 196: memb=1,
>> new=0, lost=1
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> pcmk_peer_update: memb: inacd-db-srv04 47
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> pcmk_peer_update: lost: inacd-db-srv03 46
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] notice:
>> pcmk_peer_update: Stable membership event on ring 196: memb=1, new=0, lost=0
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> pcmk_peer_update: MEMB: inacd-db-srv04 47
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> ais_mark_unseen_peer_dead: Node inacd-db-srv03 was not seen in the
>> previous transition
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> update_member: Node 46/inacd-db-srv03 is now: lost
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> send_member_notification: Sending membership update 196 to 2 children
>>
>> Notice there are 20 minutes in between the node stopping to respond
>> until I managed to tell the node to shutdown something. So for 20
>> minutes, corosync failed to determine that there was a problem with the
>> node.
>>
>> So corosync reports to pacemaker that the node is lost, and pacemaker
>> acts promptly and resolves the situation (by rebooting the dead node and
>> starting things on the good node).
>>
>> What I can see in my snmp management station is that on the lost node,
>> the load was increasing from about 8:00 until it stopped responding. The
>> last record I have (8:35) says loadavg1 around 20 (on a quad core system).
>>
>> I'll attach corosync.conf
>>
>> If you need more information, please let me know.
>>
>> Is this something that would be fixed by installing 1.2.0?
>>
>> Regards
>> Dominik
>> _______________________________________________
>> Openais mailing list
>> [email protected]
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> 


-- 
IN-telegence GmbH & Co. KG
Oskar-Jäger-Str. 125
50825 Köln

Registergericht Köln - HRA 14064, USt-ID Nr. DE 194 156 373
ph Gesellschafter: komware Unternehmensverwaltungsgesellschaft mbH,
Registergericht Köln - HRB 38396
Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Node lost detection problem in corosync 1.1.2-1

Reply via email to