Re: [Openais] Process pause detected for xxx ms, flushing membership messages?

hj lee Thu, 25 Feb 2010 13:40:14 -0800

On Thu, Feb 25, 2010 at 1:02 PM, Steven Dake <[email protected]> wrote:


> On Thu, 2010-02-25 at 11:56 -0700, hj lee wrote:
> > On Mon, Feb 22, 2010 at 11:30 AM, Steven Dake <[email protected]>
> > wrote:
> >
> >         On Sun, 2010-02-21 at 21:59 -0700, hj lee wrote:
> >         > Hi,
> >         >
> >         > I am seeing this message time to time in the log. Does this
> >         measure
> >         > the pause time of corosyns correctly? When the corosync is
> >         scheduled
> >         > back, how is memb_join message processed before pause_timer
> >         expires?
> >         > The pause_timer can expire before memb_join message, then it
> >         can not
> >         > measure the time of corosync descheduled.
> >         >
> >
> >
> >         HJ,
> >
> >         I have not seen any process pause detected messages with
> >         token=1000 at
> >         32 node count.  the pause_timer should expire every token/5,
> >         which
> >         resets the pause_timestamp indicating when corosync was last
> >         scheduled.
> >
> >         The way coropoll works though, is to schedule timers after
> >         executing
> >         delivery of all the UDP messages.  If it takes token/2 time to
> >         process
> >         all those udp messages, it is possible the timer that resets
> >         the
> >         pause_timestamp reset is being caught behind a bunch of
> >         messages
> >         processed by the poll loop.
> >
> >         Could you try the attached patch.  It resets the pause
> >         timestamp on
> >         receipt of the various message events that occur to prevent
> >         this
> >         theoretical condition.
> >
> >
> > Hi,
> >
> > Thanks for the patch. I haven't tried the patch yet. The problem I had
> > is pause detect is logged in my two-node cluster. Sometimes the log
> > says more than 900 ms paused. That's OK, but when one node prints this
> > log, then the other node gets token lost timeout, so the cluster
> > enters to GATHER mode. When a node is paused more than 900ms, there is
> > no mcast message, the corosync is pretty much idle except token
> > passing. I really can not understand why corosync is not running for
> > more than 900ms!
> >
> > The corosync is running SCHED_RR real-time scheduling policy. I should
> > get CPU. Also there are pause timer (60ms) and retransmit timer(130ms)
> > enabled in the operational mode. The corosync should run at least
> > every 60ms if there is no mcast message! Also it should wake up every
> > token or mcast message arrival. I think corosync is stuck at somewhere
> > during this 900ms pause time, either at poll() routine or message
> > processing callback. Do you have any idea about this kind of pause?
> >
> > Thanks very much
> > hj
> >
>
> The pause detection is designed to detect when the corosync process is
> not scheduled for long periods of time.  I have seen situations where
> kernel drivers take spinlocks for long periods and don't release them
> (disabling scheduling in the process).
>
>
The only driver related to corosync I can think of is network driver. Does
this spinlock holding happen in network driver code? If so, more
specifically does it at poll() call or read()/write()?

Thanks
hj

> --
> > Peakpoint Service
> >
> > Cluster Setup, Troubleshooting & Development
> > [email protected]
> > (303) 997-2823
>
>


-- 
Peakpoint Service

Cluster Setup, Troubleshooting & Development
[email protected]
(303) 997-2823

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Process pause detected for xxx ms, flushing membership messages?

Reply via email to