Hi,

On Thu, Apr 09, 2009 at 10:36:02AM +0200, Lars Ellenberg wrote:
> On Wed, Apr 08, 2009 at 10:20:58PM +0200, Julien WICQUART wrote:
> >
> >> Message: 3
> >> Date: Wed, 8 Apr 2009 17:55:34 +0200
> >> From: Dejan Muhamedagic <[email protected]>
> >> Subject: Re: [Linux-HA] Stranges "dead link" and "late heartbeat" on
> >>    sunny   Sunday.
> >> To: General Linux-HA mailing list <[email protected]>
> >> Message-ID: <[email protected]>
> >> Content-Type: text/plain; charset=us-ascii
> >>
> >> Hi,
> >>
> >> On Tue, Apr 07, 2009 at 04:25:03PM +0200, julien WICQUART wrote:
> >>   
> >>> Hi,
> >>>
> >>> Sunday the 5th of April, i've got a strange surprise.
> >>> 7 of ours 2 master/slave nodes heartbeat clusters on 2
> >>> differents datacenters have done the same "dance" at differents
> >>> times of the day.
> >>>
> >>> We used : - Different DELL servers 1850 SC1425 2950
> >>> - GNU/Linux Debian Etch kernel 2.6.18-6-686
> >>> - heartbeat 1.2.5-3
>                 ^^^^^^^

Well spotted Lars.

And I tend to forget things :)

Thanks,

Dejan

> >>> The systems were not heavily loaded and i have looked in old
> >>> logs, i have no "late heartbeat" for a long time, so no problem
> >>> like "How to tune Heartbeat on heavily loaded system to avoid
> >>> split-brain?".
> >>>
> >>>
> >>> The more strange is that heartbeat says it own system is dead.
> >>> The link on eth1 is a crossover cable so no physical device between the 2 
> >>> servers.
> >>>
> >>>
> >>> Here is a typical log of this problem :
> 
> >> The heartbeat was late for about 42 seconds.
> >>
> >> According to the timestamps, looks like the whole machine was
> >> "hanging". At the same time, heartbeat realized that there was no
> >> heartbeat (local and from other nodes) and that there's again
> >> heartbeat, albeit late. Looks like a hardware problem. SCSI
> >> cables? Do you see any other interesting messages in system logs?
> >> There used to be a kernel (I think on RedHat) which would
> >> sometimes forget to schedule processes.
> >>
> >> Thanks,
> >>
> >> Dejan
> >>
> >>   
> >
> > Hi,
> >
> > the servers don't seem to hang. I've got log 1 second before and after  
> > in other log files.
> >
> > The more strange is that in one day, the 5th of April, seven of ours  
> > clusters do the same thing. We didn't have this kind of problem before  
> > and never more since this day.
> > There is 5 clusters in 1 datacenter and 2 others in an other datacenter.
> > The only thing similar to these servers are ntp servers and DNS servers.  
> > So i have looked at ntp log, i don't see anything strange.
> 
> I already posted this to the list, but apparently used the wrong
> envelope from, as it did not come through yet.
> 
> this seem to be the old "times() wrap because of uptime wrap and broken
> glibc syscall wrapper on 32bit linux" bug.
> 
> fixed in 2.0.8 and later.
> 
> on a 250 HZ kernel, this happens all 298 days, 5 hours and 36 minutes
> (or some such).
> 
> it is uptime that matters, not process start time, nor wallclock time.
> 
> as you are on etch, but seemingly prefer the v1 haresources style
> config, I recommend to upgrade to heartbeat 2.1.4 from backports,
> and continue to use your config as is.
> 
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to