Re: [Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Lars Marowsky-Bree Wed, 16 Aug 2006 03:54:30 -0700

On 2006-08-15T17:02:44, Alan Robertson <[EMAIL PROTECTED]> wrote:

> The kludge part itself is pretty mild.  I'll reproduce it below:
> 
>       int     save_errno = errno;
>       errno   = 0;
>       ret     = times(TIMES_PARAM);
> 
>       if (errno != 0) {
>               ret = (clock_t) (-errno);
>       }
>       errno = save_errno;
>       return (unsigned long)ret;
> 
> This is a pretty small kludge.  And, it is VERY unlikely to break on any
> correctly working system.


That is already a very bad kludge to me.

So, times() isn't supposed to return an errno. Surprise, it _does_.
Should we assume that, oh, this probably is a valid return value and
stuff it into the ret value?

What if it implies an ENOMEM, because for whatever reason, the system
call required memory which wasn't available, EAGAIN because we got a
signal during the system call, or ...?

If something we expect to NOT return an error returns one, we shouldn't
fudge it up, but raise hell: This is a fail-safe, fail-fast mechanism.

> I await your better solution of how to make this work on thousands upon
> thousands of Linux systems with many dozens of kernels and versions of
> glibc with great anticipation.

Well, I think it's another instance of a rather fundamental issue.

You're mostly concerned with fixing heartbeat so that it works across
the wildest deployments, and fixing them _internally_. Regardless of
what that takes.

I think there's a class of errors we shouldn't bother with. We might
detect them (and the above mechanism would; preferrably we could check
for this at configure time and refuse to build...), but it is not our
job to fix them - the fix belongs into another layer. In this case: into
glibc.  If it breaks, get the distributor to fix it, it isn't our
problem.

This, I think, leads to a better quality system overall, even if it
means that we don't work on certain platforms. Linux got where it is by
a very _specific_ view regarding broken legacy compatibility ;-) 

For example, if we fudge it up, sure, times() works for us, but it would
still be broken for everyone else. Does that really help us? I doubt
it. Do we want to expose the services we are managing to this? No, I
think it is our responsibility as HA software to not do that.

So, in the above case, the error needs to be reported to the vendor.
Full stop.

(As the above was reported against a 2.6.5 kernel, I'm pretty sure it's
running SLES9...)


Sincerely,
    Lars

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Reply via email to