Re: [Linux-HA] clock_t wrapped around causing false resourcestart failure

Dejan Muhamedagic Tue, 30 Jun 2009 10:16:09 -0700

Hi Simon,

On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote:
> Hi Dejan,
> 
> The bug looks like a one-off occurrence. We run hundreds of hours of
> system stress tests in a week, moving resources between main and standby
> systems, and we haven't seen this error in a couple years. (There was a
> longclock error back in 2007 found by my colleague Simon Graham).


OK, so you are well acquainted with the business. Probably better
than I am.

> The longclock wrap occurred within 2:45 of a reboot. 
> The apparent coincidence seems to be that we were starting resources on
> a back-up node around 165 seconds after the node had been rebooted and
> hearbeat restarted. As I expect you know, somewhere between 160 and 175
> seconds after a heartbeat start, the longclock is configured to wrap.

No, I don't know and I couldn't find it.

> The rareness of this makes me think we hit a really obscure window... 

Looks like it. But it should be thoroughly investigated. Though I
don't understand how it can happen if the timer is monotonously
increasing.

Thanks,

Dejan

> - Simon.
> 
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Dejan
> Muhamedagic
> Sent: Tuesday, June 30, 2009 5:26 AM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] clock_t wrapped around causing false
> resourcestart failure
> 
> Hi,
> 
> On Mon, Jun 29, 2009 at 02:53:33PM -0400, Tavanyar, Simon wrote:
> > I'm running heartbeat 2.1.4
> > 
> > I'm getting a false failure on a start of my ClusterAddr resource
> > because in the same second that the resource starts, the clock_t wraps
> > around. 
> > Has anyone else seen this behavior?
> 
> Can't recall. And that shouldn't have happened. The time wrap is
> recognized (as the log message shows) and a wrap counter is added
> to the high bits so that the time is still greater than the
> previous timestamp.
> 
> Do you have any more information about this: Was it a one-off
> occurrence? Did your system really had a long uptime? How long?
> 
> Thanks,
> 
> Dejan
> 
> > Jun 22 10:04:49 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> > op=ClusterAddr_start_0
> key=8:14:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> > Jun 22 10:04:49 node0 lrmd: [14910]: info: rsc:ClusterAddr: start
> > Jun 22 10:04:49 node0 SpineFilesystem: running
> >                         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > Jun 22 10:04:49 node0 lrmd: [14910]: info: time_longclock: clock_t
> > wrapped around (uptime).
> > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: ClusterAddr:start process
> > (PID 17282) timed out (try 1).  Killing with signal SIGTERM (15).
> >                         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > Jun 22 10:04:49 node0 crmd: [14913]: info: process_lrm_event: LRM
> > operation SharedFs_monitor_30000 (call=17, rc=0) complete
> > Jun 22 10:04:49 node0 lrmd: [14910]: WARN: operation start[18] on
> > ocf::IPaddr2::ClusterAddr for client 14913, its parameters:
> > ip=[134.111.29.140] cidr_netmask=[21] broadcast=[134.111.31.255]
> > CRM_meta_timeout=[20000] crm_feature_set=[2.0] nic=[biz0] : pid
> [17282]
> > timed out
> > Jun 22 10:04:49 node0 crmd: [14913]: ERROR: process_lrm_event: LRM
> > operation ClusterAddr_start_0 (18) Timed Out (timeout=20000ms)
> > Jun 22 10:04:50 node0 lrmd: [14910]: info: rsc:ClusterAddr: stop
> > Jun 22 10:04:50 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> > op=ClusterAddr_stop_0 key=2:15:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> > Jun 22 10:04:50 node0 crmd: [14913]: info: process_lrm_event: LRM
> > operation ClusterAddr_stop_0 (call=19, rc=0) complete
> > 
> > 
> > Thanks
> > Simon 
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clock_t wrapped around causing false resourcestart failure

Reply via email to