On Thu, Apr 09, 2009 at 04:22:14PM +0200, julien WICQUART wrote:
> >
> > Date: Thu, 9 Apr 2009 10:36:02 +0200
> > From: Lars Ellenberg <[email protected]>
> > Subject: Re: [Linux-HA] Stranges "dead link" and "late heartbeat" on
> > sunny Sunday.
> > To: [email protected]
> >
> > I already posted this to the list, but apparently used the wrong
> > envelope from, as it did not come through yet.
> >
> > this seem to be the old "times() wrap because of uptime wrap and broken
> > glibc syscall wrapper on 32bit linux" bug.
> >
> > fixed in 2.0.8 and later.
> >
> > on a 250 HZ kernel, this happens all 298 days, 5 hours and 36 minutes
> > (or some such).
> >
> > it is uptime that matters, not process start time, nor wallclock time.
> >
> > as you are on etch, but seemingly prefer the v1 haresources style
> > config, I recommend to upgrade to heartbeat 2.1.4 from backports,
> > and continue to use your config as is.
> >
> >
>
> Hi Lars,
>
> you're right.
>
> All servers who make the "dance" this sunday have 302 days of uptime today.
> Thanks for a so relevant answer.
btw. I just dug this up again.
a long time ago (before heartbeat 2.0.8, even!), I did this check
function, so I would know when a heartbeat v1 cluster would hit the
problem, and could schedule the upgrades accordingly.
maybe it is of some use to you or others still on heartbeat 1.x.
you only need to adjust the "soon" part.
#!/bin/bash
soon=$(date "+%Y-%m-%d" -d "14 days")
now=$(date "+%Y-%m-%d")
check_hb_uptime_bug()
{
# no heartbeat, no heartbeat bug
test -e /etc/ha.d && test -e /etc/ha.d/ha.cf || return 0
# bug only effective on 32bit
[[ $(uname -m) = x86_64 ]] && return 0
# I want to be able to say:
# the heartbeat on _this_ box is a patched or fixed version.
# because that is easier than checking for version numbers
test -e /etc/ha.d/hb_uptime_bug_patched && return 0
set -- $(uptime |
sed -n -e 's/^.* up \([0-9]*\) days,\? \(..\):\(..\).*/\1 \2
\3/p' \
-e 's/^.* up \([0-9]*\) days.*/\1/p'
)
# if sed did not match, hopefully it is because less than a day uptime,
# not because the regex is bad.
[[ $# = 0 ]] && return 0
# with 250 HZ linux boxes, critical time is every
# 298 days, 05:36:18 to 298 days, 05:37:00
local uptime="$1 days ago ${2:-0} hours ago ${3:-0} minutes ago"
local wrap_time="298 days 5 hours 36 minutes 18 seconds"
local next_wrap="$uptime $wrap_time"
local nwt i
# find next critical date: (our server with the longest uptime
# at the time of writing: 2003 days.
# 10 * 300 days should be enough, though)
for i in 1 2 3 4 5 6 7 8 9 10; do
nwt=$(date "+%Y-%m-%d %H:%M" -d "$next_wrap")
[[ $nwt > $now ]] && break
next_wrap="$next_wrap $wrap_time"
done
# report if it will happen $soon
if [[ $nwt > $now && $nwt < $soon ]]; then
echo "heartbeat 32bit times wrap BUG will trigger on $nwt"
fi
}
check_hb_uptime_bug
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems