On Friday November 18, [EMAIL PROTECTED] wrote:
> 
> So, I continue to believe silent corruption is mythical. I'm still open
> to good explanation it's not though.
> 

Silent corruption is not mythical, though it is probably talked about
more than it actually happens (but then as it is silent, I cannot be
certain :-).

Silent corruption can happen only if an unclean degraded array is
started. 
md will not start an unclean degraded (raid 4/5/6) array (though I'm
going to add a module parameter to allow it) and mdadm will only start
such an array if given --force (in which case it modifies to appear
clean so md will start it).

If your array is not degraded, or you always shut down cleanly, there
is no opportunity for raid5-level corruption (of course, the drives may
choose to corrupt things silently themselves...).

Note that an unclean degraded start doesn't imply corruption - you
could be in this situation and not have any corruption at all.  But it
does allow it.  It must as 'unclean' means you cannot trust the
parity, and 'degraded' means that you have to.

There are two solutions to this silent corruption problem (other than
'ignore it and hope it doesn't bite' which is a fair widely used
solution, and I haven't seen any bite marks myself).

One is journalling, as has been mentioned.  This could be done to a
mirrored pair, or to a ECC NVRAM card (the latter being probably the
best, though also most expensive).  You would write each data block as
it becomes available, and each parity block just before commencing a
write to the raid5.  Obviously you also keep track of what you have
written.
I have toyed with the idea of implementing this, but I think demand is
sufficiently low that it isn't worth it.

The other is to use a filesystem that allows the problem to be avoided
by making sure that the only blocks that can be corrupted are dead
blocks.
This could be done with a copy-on-write filesystem that knows about the
raid5 geometry, and only ever writes to a stripe when no other blocks
on the stripe contain live data.
I've been working on a filesystem which does just this, and hope to
have it available in a year or two (it is a back-ground 'hobby'
project). 

I know that ZFS is a copy-on-write filesystem.  It is entirely
possible that it can do the right thing for raid5.

And as an addendum, md/raid5 never reports a block as complete to the
filesystem until the device drives have reported the data block and
the parity block as being safe.  i.e. It has a write-through cache,
not a write-behind cache.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to