On Mon, Feb 22, 2010 at 11:30 AM, Steven Dake <[email protected]> wrote:

> On Sun, 2010-02-21 at 21:59 -0700, hj lee wrote:
> > Hi,
> >
> > I am seeing this message time to time in the log. Does this measure
> > the pause time of corosyns correctly? When the corosync is scheduled
> > back, how is memb_join message processed before pause_timer expires?
> > The pause_timer can expire before memb_join message, then it can not
> > measure the time of corosync descheduled.
> >
>
> HJ,
>
> I have not seen any process pause detected messages with token=1000 at
> 32 node count.  the pause_timer should expire every token/5, which
> resets the pause_timestamp indicating when corosync was last scheduled.
>
> The way coropoll works though, is to schedule timers after executing
> delivery of all the UDP messages.  If it takes token/2 time to process
> all those udp messages, it is possible the timer that resets the
> pause_timestamp reset is being caught behind a bunch of messages
> processed by the poll loop.
>
> Could you try the attached patch.  It resets the pause timestamp on
> receipt of the various message events that occur to prevent this
> theoretical condition.
>
>
Hi,

Thanks for the patch. I haven't tried the patch yet. The problem I had is
pause detect is logged in my two-node cluster. Sometimes the log says more
than 900 ms paused. That's OK, but when one node prints this log, then the
other node gets token lost timeout, so the cluster enters to GATHER mode.
When a node is paused more than 900ms, there is no mcast message, the
corosync is pretty much idle except token passing. I really can not
understand why corosync is not running for more than 900ms!

The corosync is running SCHED_RR real-time scheduling policy. I should get
CPU. Also there are pause timer (60ms) and retransmit timer(130ms) enabled
in the operational mode. The corosync should run at least every 60ms if
there is no mcast message! Also it should wake up every token or mcast
message arrival. I think corosync is stuck at somewhere during this 900ms
pause time, either at poll() routine or message processing callback. Do you
have any idea about this kind of pause?

Thanks very much
hj

-- 
Peakpoint Service

Cluster Setup, Troubleshooting & Development
[email protected]
(303) 997-2823
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to