On Sun, 24 Apr 2011 01:07:45 -0400
Ken Wesson <kwess...@gmail.com> wrote:
> On Sun, Apr 24, 2011 at 12:01 AM, Mike Meyer <m...@mired.org> wrote:
> > On Sat, 23 Apr 2011 23:42:23 -0400
> > Ken Wesson <kwess...@gmail.com> wrote:
> >
> >> On Sat, Apr 23, 2011 at 11:35 PM, Mike Meyer <m...@mired.org> wrote:
> >> > On Sat, 23 Apr 2011 23:19:53 -0400
> >> > Ken Wesson <kwess...@gmail.com> wrote:
> >> >
> >> >> On Sat, Apr 23, 2011 at 8:13 PM, Mike Meyer <m...@mired.org> wrote:
> >> >> > On Sat, 23 Apr 2011 19:41:28 -0400
> >> >> > Ken Wesson <kwess...@gmail.com> wrote:
> >> >> > or you live in a universe where cosmic rays can flip bits and other
> >> >> > sources of hardware hiccups exist.
> >> >> Software crashes caused by non-software-bug-triggered memory
> >> >> corruption seem to me to be exceedingly rare, and they could as easily
> >> >> strike critical parts of the operating system as a multithreaded
> >> >> server program (and a large batch of independent C jobs will occupy
> >> >> more memory and have a correspondingly larger cross section as a
> >> >> target for such things).
> >> >> The best recourse if the server gets hit by something like that is
> >> >> going to be to reboot it.
> >> >
> >> > While it might be exceedingly rare on a per-cpu-second basis, if your
> >> > application runs 7x24 on enough cpus, you can expect to see them at
> >> > regular intervals. In which case the best recourse - if you want a
> >> > stable, robust application - is to restart the smallest set of
> >> > processes that might have been affected by the problem.
> >> In other words, all of them, since the operating system might have
> >> been affected by such a problem and if it was, everything else is
> >> probably affected too.
> > Let me guess - you're one of these people who
> Ah, I get it. You're arguing because you have some kind of *personal*
> issue, rather than for any logical reason.

Yup. Years of practical experience building and running such
systems. Logic and theory seldom survive contact with practice
unscathed.

> > Sure, a hardware glitch that affects the OS means you should reboot
> > the system.
> And assuming you can even detect that such a glitch has occurred at
> all (what if one hits the code doing the detecting, or the memory that
> it uses -- or the operating system, in a way that affects that code?)
> can you detect whether or not it hit the operating system?

If your hardware doesn't take a parity fault or correct the data most
of time it tries to read such data, you need better hardware. Anything
that needs the data that's corrupt needs to be killed - which means
that user process if the data is in user space. The kernel will be
fine, because it can't have used the bad data without triggering the
fault.

> > Of course, if it affects some user process, it may have
> > affected the OS without leaving evidence of doing so. Then again, it
> > may not have. While you could reboot everything "just in case", you
> > could also have a hardware glitch affect the OS without leaving
> > evidence in any process, so you might as well reboot even though
> > nothing is wrong "just in case."
> Obviously there's little point in rebooting absent *some* evidence
> that something is wrong.

Exactly. If the kernel isn't complaining about something, that's
exactly what you have - absence of evidence that something is wrong
with the kernel. Some user process failing for an unknown reason
doesn't provide any evidence that something is wrong with the kernel.

> Of course, some process segfaulting doesn't mean much if it's a
> typical C program. On the other hand, if you have a rock-solid JVM
> and kernel and various JVM bytecodes running, and the JVM faults,
> the likelihood of a problem like this is higher than if a random
> other program faulted -- indeed, either it's a JVM bug, an OS bug,
> or a glitch of the type being discussed, since arbitrary bytecode on
> a bug-free JVM shouldn't cause the JVM to fault. (Native methods
> complicate things somewhat though.)

Since neither the JVM bug nor the OS bug get fixed by restarting the
system, and the process that has glitched will most likely be fixed
equally well by restarting it as by rebooting the system, there's no
evidence to indicate a problem that would be helped in any way by
rebooting the system.

> > Nah, hardware glitches are either localized, in which case restarting
> > just the address spaces that failed is sufficient (and has proven so
> > in practice for years), or they're systemic, in which case you'll have
> > failures throughout the system. It's pretty easy to tell the
> > difference between the two and deal with them appropriately.
> Easy for who? The system administrator? I thought we were considering
> automated means of recovering faulting systems here.

Easy for automated systems to deal with. "Foo is failing repeatedly,
quit restarting it and escalate the issue" and "Multiple failures on
foo, create an escalated issue" are both standard behaviors for
application monitoring tools.

    <mike
-- 
Mike Meyer <m...@mired.org>             http://www.mired.org/consulting.html
Independent Software developer/SCM consultant, email for more information.

O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to