Hi, On Thu, Apr 09, 2009 at 10:36:02AM +0200, Lars Ellenberg wrote: > On Wed, Apr 08, 2009 at 10:20:58PM +0200, Julien WICQUART wrote: > > > >> Message: 3 > >> Date: Wed, 8 Apr 2009 17:55:34 +0200 > >> From: Dejan Muhamedagic <[email protected]> > >> Subject: Re: [Linux-HA] Stranges "dead link" and "late heartbeat" on > >> sunny Sunday. > >> To: General Linux-HA mailing list <[email protected]> > >> Message-ID: <[email protected]> > >> Content-Type: text/plain; charset=us-ascii > >> > >> Hi, > >> > >> On Tue, Apr 07, 2009 at 04:25:03PM +0200, julien WICQUART wrote: > >> > >>> Hi, > >>> > >>> Sunday the 5th of April, i've got a strange surprise. > >>> 7 of ours 2 master/slave nodes heartbeat clusters on 2 > >>> differents datacenters have done the same "dance" at differents > >>> times of the day. > >>> > >>> We used : - Different DELL servers 1850 SC1425 2950 > >>> - GNU/Linux Debian Etch kernel 2.6.18-6-686 > >>> - heartbeat 1.2.5-3 > ^^^^^^^
Well spotted Lars. And I tend to forget things :) Thanks, Dejan > >>> The systems were not heavily loaded and i have looked in old > >>> logs, i have no "late heartbeat" for a long time, so no problem > >>> like "How to tune Heartbeat on heavily loaded system to avoid > >>> split-brain?". > >>> > >>> > >>> The more strange is that heartbeat says it own system is dead. > >>> The link on eth1 is a crossover cable so no physical device between the 2 > >>> servers. > >>> > >>> > >>> Here is a typical log of this problem : > > >> The heartbeat was late for about 42 seconds. > >> > >> According to the timestamps, looks like the whole machine was > >> "hanging". At the same time, heartbeat realized that there was no > >> heartbeat (local and from other nodes) and that there's again > >> heartbeat, albeit late. Looks like a hardware problem. SCSI > >> cables? Do you see any other interesting messages in system logs? > >> There used to be a kernel (I think on RedHat) which would > >> sometimes forget to schedule processes. > >> > >> Thanks, > >> > >> Dejan > >> > >> > > > > Hi, > > > > the servers don't seem to hang. I've got log 1 second before and after > > in other log files. > > > > The more strange is that in one day, the 5th of April, seven of ours > > clusters do the same thing. We didn't have this kind of problem before > > and never more since this day. > > There is 5 clusters in 1 datacenter and 2 others in an other datacenter. > > The only thing similar to these servers are ntp servers and DNS servers. > > So i have looked at ntp log, i don't see anything strange. > > I already posted this to the list, but apparently used the wrong > envelope from, as it did not come through yet. > > this seem to be the old "times() wrap because of uptime wrap and broken > glibc syscall wrapper on 32bit linux" bug. > > fixed in 2.0.8 and later. > > on a 250 HZ kernel, this happens all 298 days, 5 hours and 36 minutes > (or some such). > > it is uptime that matters, not process start time, nor wallclock time. > > as you are on etch, but seemingly prefer the v1 haresources style > config, I recommend to upgrade to heartbeat 2.1.4 from backports, > and continue to use your config as is. > > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD? and LINBIT? are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
