Re: [Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Lars Marowsky-Bree Wed, 16 Aug 2006 09:06:33 -0700

On 2006-08-16T08:37:43, Alan Robertson <[EMAIL PROTECTED]> wrote:

> > What if it implies an ENOMEM, because for whatever reason, the system
> > call required memory which wasn't available, EAGAIN because we got a
> > signal during the system call, or ...?
> 
> As you well know, neither of these are appropriate errors for this
> system call.  All this call does is copy out data from a system data
> structure that's already being kept.  In particular, when passed NULL,
> it just returns the value of a system integer.


That isn't entirely true. 

I _know_ that times() isn't documented to have explicit errors (just
that it can return -1 and have errno set accordingly). Which is why it
returning an error is, well, surprising, no?

> > If something we expect to NOT return an error returns one, we shouldn't
> > fudge it up, but raise hell: This is a fail-safe, fail-fast mechanism.
> So, it's better to have it fail on every known version of Linux.  I got
> it.  Sorry, I can't agree with that goal.  In fact, I doubt anyone
> agrees with that goal.

No, the goal is to fix errors where they occur. And not just one of the
symptoms. While we're on technical arguments, I believe that this is
sound engineering.

> > Well, I think it's another instance of a rather fundamental issue.
> > You're mostly concerned with fixing heartbeat so that it works across
> > the wildest deployments, and fixing them _internally_. Regardless of
> > what that takes.
> 
> Yeah, you know -- like SLES8, SLES9, SLES10, RHAS3, RHEL4, RHEL5, Debian
> stable.  There are no known versions of Linux on which it runs correctly.
> 
> Please also note that this is an "ad hominem" argument.  You are arguing
> (falsely and irrelevantly) about my motivations - and in a way you
> intend to be insulting.  You intend to discredit the patch by
> discrediting my motivations.  Ad hominem arguments have no place on this
> mailing list.

Sorry, it wasn't intended as such. (Really!) But, motivation and scope
plays an important role in assessing where to fix a bug.

We can argue back and forth and ignore these factors, but then we won't
make any progress. I don't want to discredit them, but I want to point
them out.

If you're scope were to encompass more than heartbeat, and if glibc was
under your control, where would you fix that? Right - in glibc, not in
heartbeat, because it would fix a larger class of bugs. And, if you did
that, you probably wouldn't implement the work-around in heartbeat as
well, but tell them to update their glibc?

But, as it is, you're focused on making Heartbeat work - and if that
includes an (admittedly ugly, for the bug is ugly) work-around, then so
be it.

That's not discrediting your opinion in this regard, If your scope is
Heartbeat, then, why, that's the obvious (and correct!) answer. 

I just don't share it: <hat role="distributor"> I get paid to worry
about how heartbeat fits into the entire distribution we ship, and you
are suggesting to silently hide a bug in one of our fundamental
libraries, by making your specific symptom "go away". Of _course_ I
cannot agree with that.</hat>

> > This, I think, leads to a better quality system overall, even if it
> > means that we don't work on certain platforms.
> Certain platforms == "Linux".  Somehow that's not OK with me.

No, not on Linux platforms where it's broken and hasn't yet been fixed.

> > Linux got where it is by
> > a very _specific_ view regarding broken legacy compatibility ;-) 
> Is SLES10 legacy?  I wonder if your management knows that ;-)

If it also has this bug (which I haven't investigated in detail, for the
report was against SLES9, and I've filed it as such and pointed out to
our glibc maintainer that it needs to be verified against SLES10,
probably SLES8 too) it needs a fix, of course.

And then anything w/o the fix becomes legacy, yes ;-) We do not, for
example, support older Service-Packs than the last released one, et
cetera - trying to do so leads to an accumulation of work-arounds and
"compatibility" layers which makes the overall product impossible to
maintain.

> > So, in the above case, the error needs to be reported to the vendor.
> Which Vendor?  Oh yeah...  ALL the Linux vendors.  And every other
> platform we use glibc on.

It's called "reported to upstream" and then the versions need to trickle
down from there, just as if it was a glibc security issue. *shrug* The
mechanism is in place.

If we wanted to implement fixes for all known lower library bugs which
could potentially affect us, we'd be in a bad place indeed. That's not
our job.

> I perfectly agree that it should be reported to all the Linux
> distributions.  What is the Novell bugzilla number for this problem?
> What are the bugzilla numbers for the other distributions?

Novell bugzilla #199677. I don't know about others; in an ideal world,
the fix would be pushed upstream by our glibc maintainer and then
flow back down again. (Admittedly, it's not a very urgent bug, as we
tend to release bugfix versions more often than every 497 days ;-)

> But, if you want SUSE to remain pure and free of this uncleanness, feel
> free to apply a SUSE-specific patch to the SUSE RPM to break it on your
> SUSE Linux RPMs.

Yes, that's of course entirely true, and what I probably will be doing.

> I have committed the patch.

Ah, thanks for letting us finish this discussion. It was good of you to
ask, when the feedback from one of your core buddies led to all of a one
day delay ;-)

(This last paragraph _is_ an ad hominem attack, in case there was any
doubt, I just haven't made up my mind whether it was one from you to me
or vice-versa ;-)


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] heartbeat fails every 497 days on 32 bit linux

Reply via email to