Re: [Openais] Process pause detected for xxx ms, flushing membership messages?

Steven Dake Thu, 25 Feb 2010 12:14:25 -0800

On Thu, 2010-02-25 at 11:56 -0700, hj lee wrote:
> On Mon, Feb 22, 2010 at 11:30 AM, Steven Dake <[email protected]>
> wrote:
>         
>         On Sun, 2010-02-21 at 21:59 -0700, hj lee wrote:
>         > Hi,
>         >
>         > I am seeing this message time to time in the log. Does this
>         measure
>         > the pause time of corosyns correctly? When the corosync is
>         scheduled
>         > back, how is memb_join message processed before pause_timer
>         expires?
>         > The pause_timer can expire before memb_join message, then it
>         can not
>         > measure the time of corosync descheduled.
>         >
>         
>         
>         HJ,
>         
>         I have not seen any process pause detected messages with
>         token=1000 at
>         32 node count.  the pause_timer should expire every token/5,
>         which
>         resets the pause_timestamp indicating when corosync was last
>         scheduled.
>         
>         The way coropoll works though, is to schedule timers after
>         executing
>         delivery of all the UDP messages.  If it takes token/2 time to
>         process
>         all those udp messages, it is possible the timer that resets
>         the
>         pause_timestamp reset is being caught behind a bunch of
>         messages
>         processed by the poll loop.
>         
>         Could you try the attached patch.  It resets the pause
>         timestamp on
>         receipt of the various message events that occur to prevent
>         this
>         theoretical condition.
>         
> 
> Hi,
> 
> Thanks for the patch. I haven't tried the patch yet. The problem I had
> is pause detect is logged in my two-node cluster. Sometimes the log
> says more than 900 ms paused. That's OK, but when one node prints this
> log, then the other node gets token lost timeout, so the cluster
> enters to GATHER mode. When a node is paused more than 900ms, there is
> no mcast message, the corosync is pretty much idle except token
> passing. I really can not understand why corosync is not running for
> more than 900ms!
> 
> The corosync is running SCHED_RR real-time scheduling policy. I should
> get CPU. Also there are pause timer (60ms) and retransmit timer(130ms)
> enabled in the operational mode. The corosync should run at least
> every 60ms if there is no mcast message! Also it should wake up every
> token or mcast message arrival. I think corosync is stuck at somewhere
> during this 900ms pause time, either at poll() routine or message
> processing callback. Do you have any idea about this kind of pause?
> 
> Thanks very much
> hj
>


The pause detection is designed to detect when the corosync process is
not scheduled for long periods of time.  I have seen situations where
kernel drivers take spinlocks for long periods and don't release them
(disabling scheduling in the process).

Without pause detection, the membership protocol goes into a mess
because eventually it unpauses and has membership messages from a
configuration change or two before that don't apply to it.

Regards
-steve
> -- 
> Peakpoint Service
> 
> Cluster Setup, Troubleshooting & Development
> [email protected]
> (303) 997-2823

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Process pause detected for xxx ms, flushing membership messages?

Reply via email to