On Thu, Feb 25, 2010 at 1:02 PM, Steven Dake <[email protected]> wrote:
> On Thu, 2010-02-25 at 11:56 -0700, hj lee wrote: > > On Mon, Feb 22, 2010 at 11:30 AM, Steven Dake <[email protected]> > > wrote: > > > > On Sun, 2010-02-21 at 21:59 -0700, hj lee wrote: > > > Hi, > > > > > > I am seeing this message time to time in the log. Does this > > measure > > > the pause time of corosyns correctly? When the corosync is > > scheduled > > > back, how is memb_join message processed before pause_timer > > expires? > > > The pause_timer can expire before memb_join message, then it > > can not > > > measure the time of corosync descheduled. > > > > > > > > > HJ, > > > > I have not seen any process pause detected messages with > > token=1000 at > > 32 node count. the pause_timer should expire every token/5, > > which > > resets the pause_timestamp indicating when corosync was last > > scheduled. > > > > The way coropoll works though, is to schedule timers after > > executing > > delivery of all the UDP messages. If it takes token/2 time to > > process > > all those udp messages, it is possible the timer that resets > > the > > pause_timestamp reset is being caught behind a bunch of > > messages > > processed by the poll loop. > > > > Could you try the attached patch. It resets the pause > > timestamp on > > receipt of the various message events that occur to prevent > > this > > theoretical condition. > > > > > > Hi, > > > > Thanks for the patch. I haven't tried the patch yet. The problem I had > > is pause detect is logged in my two-node cluster. Sometimes the log > > says more than 900 ms paused. That's OK, but when one node prints this > > log, then the other node gets token lost timeout, so the cluster > > enters to GATHER mode. When a node is paused more than 900ms, there is > > no mcast message, the corosync is pretty much idle except token > > passing. I really can not understand why corosync is not running for > > more than 900ms! > > > > The corosync is running SCHED_RR real-time scheduling policy. I should > > get CPU. Also there are pause timer (60ms) and retransmit timer(130ms) > > enabled in the operational mode. The corosync should run at least > > every 60ms if there is no mcast message! Also it should wake up every > > token or mcast message arrival. I think corosync is stuck at somewhere > > during this 900ms pause time, either at poll() routine or message > > processing callback. Do you have any idea about this kind of pause? > > > > Thanks very much > > hj > > > > The pause detection is designed to detect when the corosync process is > not scheduled for long periods of time. I have seen situations where > kernel drivers take spinlocks for long periods and don't release them > (disabling scheduling in the process). > > The only driver related to corosync I can think of is network driver. Does this spinlock holding happen in network driver code? If so, more specifically does it at poll() call or read()/write()? Thanks hj > -- > > Peakpoint Service > > > > Cluster Setup, Troubleshooting & Development > > [email protected] > > (303) 997-2823 > > -- Peakpoint Service Cluster Setup, Troubleshooting & Development [email protected] (303) 997-2823
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
