On Sun, Jan 03, 2010 at 09:55:52PM -0800, Ryan Allen wrote:
> I am quite interested in this statistic.  Where does this 50% come from?

A dozen or so personal experiences over the last few years. 

> We have all seen the math.  If one drive has a 5% chance of failure in
> one year, the chances of two drives failing --at the same time-- is
> multiplicitive (.05^2), or .25%.  Of course having more then two disks
> increases the chances that multiple drive could fail at the same time.

Drives silently fail in ways that don't get noticed by the RAID controller
until you try a rebuild. This has happened to me fairly often. Also, drives
do not always fail in statistically-predictable ways. God help you if all
the drives in your array are the same make and model and they all start
failing at once.

> I'm under the firm belief that anybody can have a solid, secure, and
> more affordable RAID 5 setup  if 1) you have a decently small number of
> disks, say <= 6, and 2) the user replaces any failed disks (SMART, or
> other failure indicators) as quickly as they can, and 3) have a solid
> backup plan in place.   This will cut hardware budget by around 40%, and
> consume less power.

If your backups are good, *and* you are that proactive, *and* you can tell
users not to write data to a server where it may possibly be lost before you
initiate a rebuild, *and* you can wait for backups to restore without much
heartburn if the rebuild does fail.

> Can you enlighten me on the importance of "battery backed RAID cards"
> when the entire system is on a massive UPS, programmed to do a clean
> shutdown on power failure?  The system in mind has been tuned to do a
> worst case shutdown in just under 1/3 the measured battery life.  

Guaranteed data coherency on system failure, speed, cost, pick any two.

Filesystems depend on data being written in a particular order. Mail servers
and database servers depend on data being flushed out to disk when that
fsync() call returns. Waiting for the writes to happen means, well, waiting
a lot. By having the RAID controller store those writes in battery-backed
cache, the controller can report back to the OS saying that the writes have
succeeded when they haven't actually been done yet. This allows you to do a
lot more I/Os per second and dramatically reduces the latency for each
request (i.e. the delay the users see). In addition, the write cache gets
properly replayed on the next powerup (assuming it's within a few days) even
if the OS crashes or the UPS fails or a cord gets unplugged or the power
strip's circuit breaker trips or the shutdown hangs or... (believe me
there's an awful lot of "or"s.)
-- 
Robert Woodcock - r...@blarg.net
"Anybody else wanna negotiate?" -- The Fifth Element

Reply via email to