On Thu, 2010-02-25 at 11:56 -0700, hj lee wrote: > On Mon, Feb 22, 2010 at 11:30 AM, Steven Dake <[email protected]> > wrote: > > On Sun, 2010-02-21 at 21:59 -0700, hj lee wrote: > > Hi, > > > > I am seeing this message time to time in the log. Does this > measure > > the pause time of corosyns correctly? When the corosync is > scheduled > > back, how is memb_join message processed before pause_timer > expires? > > The pause_timer can expire before memb_join message, then it > can not > > measure the time of corosync descheduled. > > > > > HJ, > > I have not seen any process pause detected messages with > token=1000 at > 32 node count. the pause_timer should expire every token/5, > which > resets the pause_timestamp indicating when corosync was last > scheduled. > > The way coropoll works though, is to schedule timers after > executing > delivery of all the UDP messages. If it takes token/2 time to > process > all those udp messages, it is possible the timer that resets > the > pause_timestamp reset is being caught behind a bunch of > messages > processed by the poll loop. > > Could you try the attached patch. It resets the pause > timestamp on > receipt of the various message events that occur to prevent > this > theoretical condition. > > > Hi, > > Thanks for the patch. I haven't tried the patch yet. The problem I had > is pause detect is logged in my two-node cluster. Sometimes the log > says more than 900 ms paused. That's OK, but when one node prints this > log, then the other node gets token lost timeout, so the cluster > enters to GATHER mode. When a node is paused more than 900ms, there is > no mcast message, the corosync is pretty much idle except token > passing. I really can not understand why corosync is not running for > more than 900ms! > > The corosync is running SCHED_RR real-time scheduling policy. I should > get CPU. Also there are pause timer (60ms) and retransmit timer(130ms) > enabled in the operational mode. The corosync should run at least > every 60ms if there is no mcast message! Also it should wake up every > token or mcast message arrival. I think corosync is stuck at somewhere > during this 900ms pause time, either at poll() routine or message > processing callback. Do you have any idea about this kind of pause? > > Thanks very much > hj >
The pause detection is designed to detect when the corosync process is not scheduled for long periods of time. I have seen situations where kernel drivers take spinlocks for long periods and don't release them (disabling scheduling in the process). Without pause detection, the membership protocol goes into a mess because eventually it unpauses and has membership messages from a configuration change or two before that don't apply to it. Regards -steve > -- > Peakpoint Service > > Cluster Setup, Troubleshooting & Development > [email protected] > (303) 997-2823 _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
