Hi Simon, On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote: > Hi Dejan, > > The bug looks like a one-off occurrence. We run hundreds of hours of > system stress tests in a week, moving resources between main and standby > systems, and we haven't seen this error in a couple years. (There was a > longclock error back in 2007 found by my colleague Simon Graham).
OK, so you are well acquainted with the business. Probably better than I am. > The longclock wrap occurred within 2:45 of a reboot. > The apparent coincidence seems to be that we were starting resources on > a back-up node around 165 seconds after the node had been rebooted and > hearbeat restarted. As I expect you know, somewhere between 160 and 175 > seconds after a heartbeat start, the longclock is configured to wrap. No, I don't know and I couldn't find it. > The rareness of this makes me think we hit a really obscure window... Looks like it. But it should be thoroughly investigated. Though I don't understand how it can happen if the timer is monotonously increasing. Thanks, Dejan > - Simon. > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Dejan > Muhamedagic > Sent: Tuesday, June 30, 2009 5:26 AM > To: General Linux-HA mailing list > Subject: Re: [Linux-HA] clock_t wrapped around causing false > resourcestart failure > > Hi, > > On Mon, Jun 29, 2009 at 02:53:33PM -0400, Tavanyar, Simon wrote: > > I'm running heartbeat 2.1.4 > > > > I'm getting a false failure on a start of my ClusterAddr resource > > because in the same second that the resource starts, the clock_t wraps > > around. > > Has anyone else seen this behavior? > > Can't recall. And that shouldn't have happened. The time wrap is > recognized (as the log message shows) and a wrap counter is added > to the high bits so that the time is still greater than the > previous timestamp. > > Do you have any more information about this: Was it a one-off > occurrence? Did your system really had a long uptime? How long? > > Thanks, > > Dejan > > > Jun 22 10:04:49 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing > > op=ClusterAddr_start_0 > key=8:14:0:59f9d23b-effd-4ec4-a766-17ed34a92b34) > > Jun 22 10:04:49 node0 lrmd: [14910]: info: rsc:ClusterAddr: start > > Jun 22 10:04:49 node0 SpineFilesystem: running > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > > Jun 22 10:04:49 node0 lrmd: [14910]: info: time_longclock: clock_t > > wrapped around (uptime). > > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: ClusterAddr:start process > > (PID 17282) timed out (try 1). Killing with signal SIGTERM (15). > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > > Jun 22 10:04:49 node0 crmd: [14913]: info: process_lrm_event: LRM > > operation SharedFs_monitor_30000 (call=17, rc=0) complete > > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: operation start[18] on > > ocf::IPaddr2::ClusterAddr for client 14913, its parameters: > > ip=[134.111.29.140] cidr_netmask=[21] broadcast=[134.111.31.255] > > CRM_meta_timeout=[20000] crm_feature_set=[2.0] nic=[biz0] : pid > [17282] > > timed out > > Jun 22 10:04:49 node0 crmd: [14913]: ERROR: process_lrm_event: LRM > > operation ClusterAddr_start_0 (18) Timed Out (timeout=20000ms) > > Jun 22 10:04:50 node0 lrmd: [14910]: info: rsc:ClusterAddr: stop > > Jun 22 10:04:50 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing > > op=ClusterAddr_stop_0 key=2:15:0:59f9d23b-effd-4ec4-a766-17ed34a92b34) > > Jun 22 10:04:50 node0 crmd: [14913]: info: process_lrm_event: LRM > > operation ClusterAddr_stop_0 (call=19, rc=0) complete > > > > > > Thanks > > Simon > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
