Re: [Openais] Node lost detection problem in corosync 1.1.2-1

Andrew Beekhof Wed, 03 Feb 2010 00:11:45 -0800

On Tue, Feb 2, 2010 at 10:47 PM, Steven Dake <[email protected]> wrote:
> On Tue, 2010-02-02 at 10:12 +0100, Dominik Klein wrote:
>> Hi
>>
>> the following situation happened on a 2 node corosync/pacemaker cluster
>> running 64 bit openSuSE 11.1. According to rpm -q Pacemaker is version
>> 1.0.6-1 and Corosync is version 1.1.2-1.
>>
>> This morning (at about 8:37) one of my cluster nodes stopped responding.
>> No ping, no ssh, no service.
>>
>> Corosync however did not notice that the node was down until I told the
>> alive node to shutdown a resource on that node. Immediately after doing
>> so, corosync logged:
>
> Other corosync daemons will always notice a failed node running a
> corosync daemon is failing to participate in cluster communication.  It
> is possible for corosync to still schedule, while other processes are
> locked out from interactive processing because corosync runs as a
> realtime process.  If it consumes all cpu time in a spinlock, it will
> still schedule itself, but fail to schedule other non rt processes in
> the system.


My problem with that theory is that its corosync that first notices
the node is unavailable/failed.
Only after Corosync notices does Pacemaker react - its not Pacemaker
detecting that its peer process(es) are not responding.

But it seems that Corosync only notices once Pacemaker asks it to send
a cluster message.

Anyway, lets see if Dominik can reproduce with 1.2

>
> My first guess is that this is what happened to you as a result of the
> root cause of the patch fixed by revision 2558.  I recommend trying
> corosync 1.2.0, or backporting revision 2558 from branches/wilson into
> your environment.
>
> If your problem persists, please file a defect.
>
> Also, please note that corosync 1.1.2 has a problem with timestamp: on
> and the use of pacemaker because of thread safety issues.  A workaround
> is to use timestamp: off, or to use corosync 1.2.0, or backport wilson
> revision 2626 into your environment.
>
> Regards
> -steve
>>
>> Feb  2 08:58:06 inacd-db-srv04 corosync[9613]:   [TOTEM ] A processor
>> failed, forming new configuration.
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] notice:
>> pcmk_peer_update: Transitional membership event on ring 196: memb=1,
>> new=0, lost=1
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> pcmk_peer_update: memb: inacd-db-srv04 47
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> pcmk_peer_update: lost: inacd-db-srv03 46
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] notice:
>> pcmk_peer_update: Stable membership event on ring 196: memb=1, new=0, lost=0
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> pcmk_peer_update: MEMB: inacd-db-srv04 47
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> ais_mark_unseen_peer_dead: Node inacd-db-srv03 was not seen in the
>> previous transition
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> update_member: Node 46/inacd-db-srv03 is now: lost
>> Feb  2 08:58:09 inacd-db-srv04 corosync[9613]:   [pcmk  ] info:
>> send_member_notification: Sending membership update 196 to 2 children
>>
>> Notice there are 20 minutes in between the node stopping to respond
>> until I managed to tell the node to shutdown something. So for 20
>> minutes, corosync failed to determine that there was a problem with the
>> node.
>>
>> So corosync reports to pacemaker that the node is lost, and pacemaker
>> acts promptly and resolves the situation (by rebooting the dead node and
>> starting things on the good node).
>>
>> What I can see in my snmp management station is that on the lost node,
>> the load was increasing from about 8:00 until it stopped responding. The
>> last record I have (8:35) says loadavg1 around 20 (on a quad core system).
>>
>> I'll attach corosync.conf
>>
>> If you need more information, please let me know.
>>
>> Is this something that would be fixed by installing 1.2.0?
>>
>> Regards
>> Dominik
>> _______________________________________________
>> Openais mailing list
>> [email protected]
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Node lost detection problem in corosync 1.1.2-1

Reply via email to