Re: MTBF

David Boyes Mon, 22 Apr 2002 18:28:57 -0700

On Mon, Apr 22, 2002 at 04:26:39PM -0500, Holly, Jason wrote:
> has anyone established mean-time-between-failure numbers for linux instances
> running under vm?  anything general would be good information.  i'm curious
> about disk, memory or other system failures that compromise the vm
> instances.


The record for us is about 9 months for a single Linux image. Average
is about 3-4 months between reboots, depending on what's running in
them -- things that suck up lots of memory like Websphere tend to
shorten the lifespan of the machine by fragmenting storage. Machines
that get a lot of interactive use tend to collect a few zombies after
a while, so reboots become a reasonably good idea after a while.

For the VM side, I know of sites with uptimes measured in years; my personal
record is 670 days w/o IPL (stopped by some idjjit doing something
stupid with a tape drive). With a bit of planning, you could probably
easily exceed that -- use CSE and clustered machines and a few other tricks.
Disks and hung tape drives are the biggest risk; memory and CPUs are
pretty much hot-pluggable these days and there is tons of automated
sparing going on so things in the box are seldom the failure point.
For disk, most of it is RAIDed, so you can replace any single spindle
failure before you lose the volume from the 390 perspective.

When we have multicast support for guest LANs and can do clustering
more effectively, I thijk we'll see less impact on the service
availability from taking a single instance out. We'll see.

-- db

Re: MTBF

Reply via email to