Hi Dejan, The bug looks like a one-off occurrence. We run hundreds of hours of system stress tests in a week, moving resources between main and standby systems, and we haven't seen this error in a couple years. (There was a longclock error back in 2007 found by my colleague Simon Graham).
The longclock wrap occurred within 2:45 of a reboot. The apparent coincidence seems to be that we were starting resources on a back-up node around 165 seconds after the node had been rebooted and hearbeat restarted. As I expect you know, somewhere between 160 and 175 seconds after a heartbeat start, the longclock is configured to wrap. The rareness of this makes me think we hit a really obscure window... - Simon. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Dejan Muhamedagic Sent: Tuesday, June 30, 2009 5:26 AM To: General Linux-HA mailing list Subject: Re: [Linux-HA] clock_t wrapped around causing false resourcestart failure Hi, On Mon, Jun 29, 2009 at 02:53:33PM -0400, Tavanyar, Simon wrote: > I'm running heartbeat 2.1.4 > > I'm getting a false failure on a start of my ClusterAddr resource > because in the same second that the resource starts, the clock_t wraps > around. > Has anyone else seen this behavior? Can't recall. And that shouldn't have happened. The time wrap is recognized (as the log message shows) and a wrap counter is added to the high bits so that the time is still greater than the previous timestamp. Do you have any more information about this: Was it a one-off occurrence? Did your system really had a long uptime? How long? Thanks, Dejan > Jun 22 10:04:49 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing > op=ClusterAddr_start_0 key=8:14:0:59f9d23b-effd-4ec4-a766-17ed34a92b34) > Jun 22 10:04:49 node0 lrmd: [14910]: info: rsc:ClusterAddr: start > Jun 22 10:04:49 node0 SpineFilesystem: running > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > Jun 22 10:04:49 node0 lrmd: [14910]: info: time_longclock: clock_t > wrapped around (uptime). > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: ClusterAddr:start process > (PID 17282) timed out (try 1). Killing with signal SIGTERM (15). > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > Jun 22 10:04:49 node0 crmd: [14913]: info: process_lrm_event: LRM > operation SharedFs_monitor_30000 (call=17, rc=0) complete > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: operation start[18] on > ocf::IPaddr2::ClusterAddr for client 14913, its parameters: > ip=[134.111.29.140] cidr_netmask=[21] broadcast=[134.111.31.255] > CRM_meta_timeout=[20000] crm_feature_set=[2.0] nic=[biz0] : pid [17282] > timed out > Jun 22 10:04:49 node0 crmd: [14913]: ERROR: process_lrm_event: LRM > operation ClusterAddr_start_0 (18) Timed Out (timeout=20000ms) > Jun 22 10:04:50 node0 lrmd: [14910]: info: rsc:ClusterAddr: stop > Jun 22 10:04:50 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing > op=ClusterAddr_stop_0 key=2:15:0:59f9d23b-effd-4ec4-a766-17ed34a92b34) > Jun 22 10:04:50 node0 crmd: [14913]: info: process_lrm_event: LRM > operation ClusterAddr_stop_0 (call=19, rc=0) complete > > > Thanks > Simon > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
