On Mon, Apr 22, 2002 at 04:26:39PM -0500, Holly, Jason wrote: > has anyone established mean-time-between-failure numbers for linux instances > running under vm? anything general would be good information. i'm curious > about disk, memory or other system failures that compromise the vm > instances.
The record for us is about 9 months for a single Linux image. Average is about 3-4 months between reboots, depending on what's running in them -- things that suck up lots of memory like Websphere tend to shorten the lifespan of the machine by fragmenting storage. Machines that get a lot of interactive use tend to collect a few zombies after a while, so reboots become a reasonably good idea after a while. For the VM side, I know of sites with uptimes measured in years; my personal record is 670 days w/o IPL (stopped by some idjjit doing something stupid with a tape drive). With a bit of planning, you could probably easily exceed that -- use CSE and clustered machines and a few other tricks. Disks and hung tape drives are the biggest risk; memory and CPUs are pretty much hot-pluggable these days and there is tons of automated sparing going on so things in the box are seldom the failure point. For disk, most of it is RAIDed, so you can replace any single spindle failure before you lose the volume from the 390 perspective. When we have multicast support for guest LANs and can do clustering more effectively, I thijk we'll see less impact on the service availability from taking a single instance out. We'll see. -- db
