Re: [Linux-HA] clock_t wrapped around causing false resourcestart failure

Tavanyar, Simon Tue, 30 Jun 2009 07:04:12 -0700

Hi Dejan,

The bug looks like a one-off occurrence. We run hundreds of hours of
system stress tests in a week, moving resources between main and standby
systems, and we haven't seen this error in a couple years. (There was a
longclock error back in 2007 found by my colleague Simon Graham).


The longclock wrap occurred within 2:45 of a reboot. 
The apparent coincidence seems to be that we were starting resources on
a back-up node around 165 seconds after the node had been rebooted and
hearbeat restarted. As I expect you know, somewhere between 160 and 175
seconds after a heartbeat start, the longclock is configured to wrap.

The rareness of this makes me think we hit a really obscure window... 

- Simon.


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Dejan
Muhamedagic
Sent: Tuesday, June 30, 2009 5:26 AM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] clock_t wrapped around causing false
resourcestart failure

Hi,

On Mon, Jun 29, 2009 at 02:53:33PM -0400, Tavanyar, Simon wrote:
> I'm running heartbeat 2.1.4
> 
> I'm getting a false failure on a start of my ClusterAddr resource
> because in the same second that the resource starts, the clock_t wraps
> around. 
> Has anyone else seen this behavior?

Can't recall. And that shouldn't have happened. The time wrap is
recognized (as the log message shows) and a wrap counter is added
to the high bits so that the time is still greater than the
previous timestamp.

Do you have any more information about this: Was it a one-off
occurrence? Did your system really had a long uptime? How long?

Thanks,

Dejan

> Jun 22 10:04:49 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> op=ClusterAddr_start_0
key=8:14:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> Jun 22 10:04:49 node0 lrmd: [14910]: info: rsc:ClusterAddr: start
> Jun 22 10:04:49 node0 SpineFilesystem: running
>                         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> Jun 22 10:04:49 node0 lrmd: [14910]: info: time_longclock: clock_t
> wrapped around (uptime).
> Jun 22 10:04:49 node0 lrmd: [14910]: WARN: ClusterAddr:start process
> (PID 17282) timed out (try 1).  Killing with signal SIGTERM (15).
>                         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> Jun 22 10:04:49 node0 crmd: [14913]: info: process_lrm_event: LRM
> operation SharedFs_monitor_30000 (call=17, rc=0) complete
> Jun 22 10:04:49 node0 lrmd: [14910]: WARN: operation start[18] on
> ocf::IPaddr2::ClusterAddr for client 14913, its parameters:
> ip=[134.111.29.140] cidr_netmask=[21] broadcast=[134.111.31.255]
> CRM_meta_timeout=[20000] crm_feature_set=[2.0] nic=[biz0] : pid
[17282]
> timed out
> Jun 22 10:04:49 node0 crmd: [14913]: ERROR: process_lrm_event: LRM
> operation ClusterAddr_start_0 (18) Timed Out (timeout=20000ms)
> Jun 22 10:04:50 node0 lrmd: [14910]: info: rsc:ClusterAddr: stop
> Jun 22 10:04:50 node0 crmd: [14913]: info: do_lrm_rsc_op: Performing
> op=ClusterAddr_stop_0 key=2:15:0:59f9d23b-effd-4ec4-a766-17ed34a92b34)
> Jun 22 10:04:50 node0 crmd: [14913]: info: process_lrm_event: LRM
> operation ClusterAddr_stop_0 (call=19, rc=0) complete
> 
> 
> Thanks
> Simon 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clock_t wrapped around causing false resourcestart failure

Reply via email to