Lars Marowsky-Bree wrote:
> On 2006-08-16T08:37:43, Alan Robertson <[EMAIL PROTECTED]> wrote:
> 
>>> What if it implies an ENOMEM, because for whatever reason, the system
>>> call required memory which wasn't available, EAGAIN because we got a
>>> signal during the system call, or ...?
>> As you well know, neither of these are appropriate errors for this
>> system call.  All this call does is copy out data from a system data
>> structure that's already being kept.  In particular, when passed NULL,
>> it just returns the value of a system integer.
> 
> That isn't entirely true. 

What isn't true?  If something I said above is untrue, then please point
out my error in detail so I can correct my misinformation.

> I _know_ that times() isn't documented to have explicit errors (just
> that it can return -1 and have errno set accordingly). Which is why it
> returning an error is, well, surprising, no?

No, it's not surprising.  It's a known bug.  Known bugs aren't
surprising.  Only unknown bugs are surprising.

>>> If something we expect to NOT return an error returns one, we shouldn't
>>> fudge it up, but raise hell: This is a fail-safe, fail-fast mechanism.
>> So, it's better to have it fail on every known version of Linux.  I got
>> it.  Sorry, I can't agree with that goal.  In fact, I doubt anyone
>> agrees with that goal.
> 
> No, the goal is to fix errors where they occur. And not just one of the
> symptoms. While we're on technical arguments, I believe that this is
> sound engineering.

I haven't ever disagreed with fixing it in glibc.  Let me know when you
get it fixed in the glibc base code and all the distros have switched to
that new base version of glibc.  _Then_ it will be really fixed, and we
can take this ugly kludge out.  And we will ALL celebrate!

>>> Well, I think it's another instance of a rather fundamental issue.
>>> You're mostly concerned with fixing heartbeat so that it works across
>>> the wildest deployments, and fixing them _internally_. Regardless of
>>> what that takes.
>> Yeah, you know -- like SLES8, SLES9, SLES10, RHAS3, RHEL4, RHEL5, Debian
>> stable.  There are no known versions of Linux on which it runs correctly.
>>
>> Please also note that this is an "ad hominem" argument.  You are arguing
>> (falsely and irrelevantly) about my motivations - and in a way you
>> intend to be insulting.  You intend to discredit the patch by
>> discrediting my motivations.  Ad hominem arguments have no place on this
>> mailing list.
> 
> Sorry, it wasn't intended as such. (Really!) But, motivation and scope
> plays an important role in assessing where to fix a bug.

You may not have intended it to be insulting, but it is a prima facia ad
hominem argument.

> We can argue back and forth and ignore these factors, but then we won't
> make any progress. I don't want to discredit them, but I want to point
> them out.
> 
> If you're scope were to encompass more than heartbeat, and if glibc was
> under your control, where would you fix that? Right - in glibc, not in
> heartbeat, because it would fix a larger class of bugs. And, if you did
> that, you probably wouldn't implement the work-around in heartbeat as
> well, but tell them to update their glibc?
> 
> But, as it is, you're focused on making Heartbeat work - and if that
> includes an (admittedly ugly, for the bug is ugly) work-around, then so
> be it.
> 
> That's not discrediting your opinion in this regard, If your scope is
> Heartbeat, then, why, that's the obvious (and correct!) answer. 

The project's scope is by definition "heartbeat".  Since this is a
vendor-neutral project (and even OS-neutral), it is an error to assume
that we can fix everything upstream in a timely fashion.

> I just don't share it: <hat role="distributor"> I get paid to worry
> about how heartbeat fits into the entire distribution we ship, and you
> are suggesting to silently hide a bug in one of our fundamental
> libraries, by making your specific symptom "go away". Of _course_ I
> cannot agree with that.</hat>

I strongly suggested reporting it upstream.  Please try not to put words
in my mouth which are in direct contradiction to what I stated.

I understand completely your role as distributor.  I'm happy to help you
any way I can.  I am delighted for SUSE's support.  But, this matter is
not a SUSE matter, it's a project matter.  So, your role as distributor,
although much appreciated, is irrelevant in this matter.  So, take the
distributor hat off and join the project.

>>> This, I think, leads to a better quality system overall, even if it
>>> means that we don't work on certain platforms.
>> Certain platforms == "Linux".  Somehow that's not OK with me.
> 
> No, not on Linux platforms where it's broken and hasn't yet been fixed.

Which is EVERY SINGLE Linux platform as of right now.  I think our users
deserve software that doesn't suck.  And, when it's in my power to do
so, I'm going to keep giving them the best I can do to keep bugs from
causing our software to suck - regardless of whose bug it is.

You can take some other approach if you want.  As for me, I'm committed
to having this code work for the people who use it.

>>> Linux got where it is by
>>> a very _specific_ view regarding broken legacy compatibility ;-) 
>> Is SLES10 legacy?  I wonder if your management knows that ;-)
> 
> If it also has this bug (which I haven't investigated in detail, for the
> report was against SLES9, and I've filed it as such and pointed out to
> our glibc maintainer that it needs to be verified against SLES10,
> probably SLES8 too) it needs a fix, of course.
> 
> And then anything w/o the fix becomes legacy, yes ;-) We do not, for
> example, support older Service-Packs than the last released one, et
> cetera - trying to do so leads to an accumulation of work-arounds and
> "compatibility" layers which makes the overall product impossible to
> maintain.

But, that's an irrelevant argument - since all known versions are broken
and likely will be for a long time.

>>> So, in the above case, the error needs to be reported to the vendor.
>> Which Vendor?  Oh yeah...  ALL the Linux vendors.  And every other
>> platform we use glibc on.
> 
> It's called "reported to upstream" and then the versions need to trickle
> down from there, just as if it was a glibc security issue. *shrug* The
> mechanism is in place.

And, it's a slow and uncertain process.

A proper analogy is to say that certain BIOS calls are broken, so Linux
should go on and not work on any existing system because it's broken on
every system.  The approach which has made Linux successful is to work
around all the brain-deadness in the world and triumph over it.  Surely,
getting things fixed is better.  But it's always slow, and a long way
from certain.

So, let's do both.  Let's fix it, and let's work around it until it's fixed.

> If we wanted to implement fixes for all known lower library bugs which
> could potentially affect us, we'd be in a bad place indeed. That's not
> our job.

Fortunately, we aren't affected by very many such bugs.  Probably only
about a half-dozen.   We could do as you say, and ignore it inside our
code.  Of course, a half-dozen or so things would be broken - not just
this piece of code.

Meanwhile our users suffer -- again.  This particular bug we've tried to
fix at least 2 or 3 times before.  I'm REALLY tired of it.

>> I perfectly agree that it should be reported to all the Linux
>> distributions.  What is the Novell bugzilla number for this problem?
>> What are the bugzilla numbers for the other distributions?
> 
> Novell bugzilla #199677. I don't know about others; in an ideal world,
> the fix would be pushed upstream by our glibc maintainer and then
> flow back down again. (Admittedly, it's not a very urgent bug, as we
> tend to release bugfix versions more often than every 497 days ;-)

FYI: This bug is blocked from public access, and for access from me, for
that matter.

>> But, if you want SUSE to remain pure and free of this uncleanness, feel
>> free to apply a SUSE-specific patch to the SUSE RPM to break it on your
>> SUSE Linux RPMs.
> 
> Yes, that's of course entirely true, and what I probably will be doing.

I added a configure option to disable the kludge to make disabling it
easier for you.  I trust you'll find it to your liking.

>> I have committed the patch.
> 
> Ah, thanks for letting us finish this discussion. It was good of you to
> ask, when the feedback from one of your core buddies led to all of a one
> day delay ;-)

I read your reply and replied in detail to it.  I spent an hour or two
writing the reply I sent to you.  I sent the email and committed the fix
at the same time.

I waited for your reply.  I confess I found it disappointing.


-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to