Re: [Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Alan Robertson Wed, 16 Aug 2006 07:37:48 -0700

Lars Marowsky-Bree wrote:
> On 2006-08-15T17:02:44, Alan Robertson <[EMAIL PROTECTED]> wrote:
> 
>> The kludge part itself is pretty mild.  I'll reproduce it below:
>>
>>      int     save_errno = errno;
>>      errno   = 0;
>>      ret     = times(TIMES_PARAM);
>>
>>      if (errno != 0) {
>>              ret = (clock_t) (-errno);
>>      }
>>      errno = save_errno;
>>      return (unsigned long)ret;
>>
>> This is a pretty small kludge.  And, it is VERY unlikely to break on any
>> correctly working system.
> 
> That is already a very bad kludge to me.
> 
> So, times() isn't supposed to return an errno. Surprise, it _does_.
> Should we assume that, oh, this probably is a valid return value and
> stuff it into the ret value?


If it weren't so difficult to test for, we could make it an autoconf
test.  In fact, autoconf is full of tests for broken system behavior.
And, we already have some ourselves - to deal with broken getpid calls
in glibc.

> What if it implies an ENOMEM, because for whatever reason, the system
> call required memory which wasn't available, EAGAIN because we got a
> signal during the system call, or ...?

As you well know, neither of these are appropriate errors for this
system call.  All this call does is copy out data from a system data
structure that's already being kept.  In particular, when passed NULL,
it just returns the value of a system integer.

> If something we expect to NOT return an error returns one, we shouldn't
> fudge it up, but raise hell: This is a fail-safe, fail-fast mechanism.

So, it's better to have it fail on every known version of Linux.  I got
it.  Sorry, I can't agree with that goal.  In fact, I doubt anyone
agrees with that goal.

>> I await your better solution of how to make this work on thousands upon
>> thousands of Linux systems with many dozens of kernels and versions of
>> glibc with great anticipation.
> 
> Well, I think it's another instance of a rather fundamental issue.
> You're mostly concerned with fixing heartbeat so that it works across
> the wildest deployments, and fixing them _internally_. Regardless of
> what that takes.

Yeah, you know -- like SLES8, SLES9, SLES10, RHAS3, RHEL4, RHEL5, Debian
stable.  There are no known versions of Linux on which it runs correctly.

Please also note that this is an "ad hominem" argument.  You are arguing
(falsely and irrelevantly) about my motivations - and in a way you
intend to be insulting.  You intend to discredit the patch by
discrediting my motivations.  Ad hominem arguments have no place on this
mailing list.

> I think there's a class of errors we shouldn't bother with. We might
> detect them (and the above mechanism would; preferrably we could check
> for this at configure time and refuse to build...), but it is not our
> job to fix them - the fix belongs into another layer. In this case: into
> glibc.  If it breaks, get the distributor to fix it, it isn't our
> problem.

I agree that there is such a class of bugs.  But one on which it never
works for anyone isn't such a bug that can be ignored.

> This, I think, leads to a better quality system overall, even if it
> means that we don't work on certain platforms.

Certain platforms == "Linux".  Somehow that's not OK with me.

> Linux got where it is by
> a very _specific_ view regarding broken legacy compatibility ;-) 

Is SLES10 legacy?  I wonder if your management knows that ;-)

> For example, if we fudge it up, sure, times() works for us, but it would
> still be broken for everyone else. Does that really help us? I doubt
> it. Do we want to expose the services we are managing to this? No, I
> think it is our responsibility as HA software to not do that.
> 
> So, in the above case, the error needs to be reported to the vendor.

Which Vendor?  Oh yeah...  ALL the Linux vendors.  And every other
platform we use glibc on.

> Full stop.
> 
> (As the above was reported against a 2.6.5 kernel, I'm pretty sure it's
> running SLES9...)

Wolfgang states this is a glibc bug.  It no doubt exists in every recent
version of glibc.  Glibc bugs are notoriously hard to get fixed
everywhere.  For most people, the fix will be available in 1.5 to 2
years - because the distros are extremely slow to put out non-security
patches.

You have not yet proposed a practical mechanism for dealing with the
problem.  Reporting it to the complete set of Linux vendors is fine.
It's an excellent thing to do.  It just doesn't solve even one instance
of the problem in a predictable timeframe.

I perfectly agree that it should be reported to all the Linux
distributions.  What is the Novell bugzilla number for this problem?
What are the bugzilla numbers for the other distributions?

Our bug number is 1407.

But, if you want SUSE to remain pure and free of this uncleanness, feel
free to apply a SUSE-specific patch to the SUSE RPM to break it on your
SUSE Linux RPMs.

I have committed the patch.


-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Reply via email to