On Wed, Apr 08, 2009 at 10:20:58PM +0200, Julien WICQUART wrote: > >> Message: 3 >> Date: Wed, 8 Apr 2009 17:55:34 +0200 >> From: Dejan Muhamedagic <[email protected]> >> Subject: Re: [Linux-HA] Stranges "dead link" and "late heartbeat" on >> sunny Sunday. >> To: General Linux-HA mailing list <[email protected]> >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset=us-ascii >> >> Hi, >> >> On Tue, Apr 07, 2009 at 04:25:03PM +0200, julien WICQUART wrote: >> >>> Hi, >>> >>> Sunday the 5th of April, i've got a strange surprise. >>> 7 of ours 2 master/slave nodes heartbeat clusters on 2 >>> differents datacenters have done the same "dance" at differents >>> times of the day. >>> >>> We used : - Different DELL servers 1850 SC1425 2950 >>> - GNU/Linux Debian Etch kernel 2.6.18-6-686 >>> - heartbeat 1.2.5-3 ^^^^^^^
>>> The systems were not heavily loaded and i have looked in old >>> logs, i have no "late heartbeat" for a long time, so no problem >>> like "How to tune Heartbeat on heavily loaded system to avoid >>> split-brain?". >>> >>> >>> The more strange is that heartbeat says it own system is dead. >>> The link on eth1 is a crossover cable so no physical device between the 2 >>> servers. >>> >>> >>> Here is a typical log of this problem : >> The heartbeat was late for about 42 seconds. >> >> According to the timestamps, looks like the whole machine was >> "hanging". At the same time, heartbeat realized that there was no >> heartbeat (local and from other nodes) and that there's again >> heartbeat, albeit late. Looks like a hardware problem. SCSI >> cables? Do you see any other interesting messages in system logs? >> There used to be a kernel (I think on RedHat) which would >> sometimes forget to schedule processes. >> >> Thanks, >> >> Dejan >> >> > > Hi, > > the servers don't seem to hang. I've got log 1 second before and after > in other log files. > > The more strange is that in one day, the 5th of April, seven of ours > clusters do the same thing. We didn't have this kind of problem before > and never more since this day. > There is 5 clusters in 1 datacenter and 2 others in an other datacenter. > The only thing similar to these servers are ntp servers and DNS servers. > So i have looked at ntp log, i don't see anything strange. I already posted this to the list, but apparently used the wrong envelope from, as it did not come through yet. this seem to be the old "times() wrap because of uptime wrap and broken glibc syscall wrapper on 32bit linux" bug. fixed in 2.0.8 and later. on a 250 HZ kernel, this happens all 298 days, 5 hours and 36 minutes (or some such). it is uptime that matters, not process start time, nor wallclock time. as you are on etch, but seemingly prefer the v1 haresources style config, I recommend to upgrade to heartbeat 2.1.4 from backports, and continue to use your config as is. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
