On Sun, Apr 24, 2011 at 3:25 AM, Mike Meyer <m...@mired.org> wrote:
[massive snip]
>> Ah, I get it. You're arguing because you have some kind of *personal*
>> issue, rather than for any logical reason.
>
> Yup.

Well, then.

>> > Sure, a hardware glitch that affects the OS means you should reboot
>> > the system.
>> And assuming you can even detect that such a glitch has occurred at
>> all (what if one hits the code doing the detecting, or the memory that
>> it uses -- or the operating system, in a way that affects that code?)
>> can you detect whether or not it hit the operating system?
>
> If your hardware doesn't take a parity fault or correct the data most
> of time it tries to read such data, you need better hardware.

We're discussing (only) whatever gets through such countermeasures, obviously.

> Anything that needs the data that's corrupt needs to be killed - which
> means that user process if the data is in user space.

But you have no way of knowing, in general, where the affected data
is, unless you have only single-bit errors and the whole system is
full of parity checks from the bottom on up (slow!), whereupon
ASSUMING there's never more than one error in a given short period of
time and ASSUMING the error-detection isn't ever itself hit by it then
you can have some sort of bus-error signal halt the affected program.
And even then, if it hits a critical OS component, boom! Blue screen
time.

> The kernel will be fine, because it can't have used the bad data without
> triggering the fault.

Unless of course the fault-detection stuff is part of what gets
affected. Then all bets are off.

>> Obviously there's little point in rebooting absent *some* evidence
>> that something is wrong.
>
> Exactly. If the kernel isn't complaining about something, that's
> exactly what you have - absence of evidence that something is wrong
> with the kernel. Some user process failing for an unknown reason
> doesn't provide any evidence that something is wrong with the kernel.

As I said, depending on the user process and how rock-solid it is if
run on an ideal zero-fault hardware (e.g., a mathematical Turing
machine), a fault in it may be evidence of a hardware failure that
might have affected more of the system, or may have hit the OS in such
a way as to make it "blame" a particular userland process (for
instance, by generating a false positive in a fault detector -- well,
a true positive, since there really is a fault, but one which hits the
very bits used to specify the fault location, perhaps, or something of
the sort).

> Since neither the JVM bug nor the OS bug get fixed by restarting the
> system, and the process that has glitched will most likely be fixed
> equally well by restarting it as by rebooting the system, there's no
> evidence to indicate a problem that would be helped in any way by
> rebooting the system.

You're ignoring the third possibility, hardware glitch, which was the
whole reason for this digression to begin with. That may be fixable
only by restarting the system. And in the event it's an OS bug, the
bug itself may not be fixed, but its consequences (temporarily) will
be, given that the bug apparently lets the OS run normally for a while
before tripping over its own shoelaces. Restart it and it will run
some more before eventually tripping again, and progress can still be
made, though with annoying interruptions now and again.

>> > Nah, hardware glitches are either localized, in which case restarting
>> > just the address spaces that failed is sufficient (and has proven so
>> > in practice for years), or they're systemic, in which case you'll have
>> > failures throughout the system. It's pretty easy to tell the
>> > difference between the two and deal with them appropriately.
>> Easy for who? The system administrator? I thought we were considering
>> automated means of recovering faulting systems here.
>
> Easy for automated systems to deal with. "Foo is failing repeatedly,
> quit restarting it and escalate the issue" and "Multiple failures on
> foo, create an escalated issue" are both standard behaviors for
> application monitoring tools.

That's a heuristic approach. But it assumes that a lower-level fault
(OS, or hardware) would keep striking the same userland process. In
all likelihood, a fault that was affecting low-level features of the
system and manifested as crashes in userland processes would hit many
such processes, with long intervals between hitting a particular one
twice. So you wouldn't get "foo is failing repeatedly"; you'd just get
"there's been a higher-than-typical rate of segfaults system-wide
lately", and I don't know if this signal would rise above the noise
given a typical Unix box populated with C programs full of dodgy
pointer arithmetic. :)

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to