Re: the " 'official' point of view" expressed by kernelnewbies.org

Gregory Maxwell Tue, 15 Aug 2006 18:14:50 -0700

On 8/15/06, Edward Shishkin <[EMAIL PROTECTED]> wrote:

checksumming is _not_ much more easy then ecc-ing from implementation
standpoint, however it would be nice, if some part of errors will get
fixed without massive surgery performed by fsck


We need checksumming even with eccing... ECCing on large spans of data
is too computationally costly to do unless we know something is wrong
(via a checksum).

Lets pause for a minute, when you talk about ECC what are you actually
talking about?   If you're talking about a hamming code (used on ram,
http://en.wikipedia.org/wiki/Hamming_code) or Convolutional code (used
on telecom links, http://en.wikipedia.org/wiki/Convolutional_code) or
are you talking about an erasure code like RS coding
(http://en.wikipedia.org/wiki/Reed-Solomon_code)?

I assume in the discussions that you're not talking about an RS like
code... because RAID-5 and RAID-6 are, fundamentally, a form of RS
coding. They don't solve bit errors, but when you know you've lost a
block of data they can recover it.

Non-RS forms of ECC are very slow in software (esp decoding) .. and
really aren't that useful: most of the time HDD's will lose data in
nice big chunks that erasure codes handle well but other codes do not.

The challenge with erasure codes is that you must know that a block is
bad... most of the times the drives will tell you, but some times
corruption leaks through. This is where block level checksums come
into play... they allow you to detect bad blocks and then your erasure
code allows you to recover the data.   The checksum must be fast
because you must perform it on every read from disk... this makes ECC
unsuitable, because although it could detect errors, it is too slow.
Also, the number of additional errors ECC could fix are very small..
It would simply be better to store more erasure code blocks.

An optimal RS codes which allows one block of N to fail (and require
one block extra storage)  is computationally trivial. We call it
raid-5.  If your 'threat model' is bad sectors rather than bad disks
(an increasingly realistic shift) then N needs to have nothing to do
with the number of disks you have and can be instead related to how
much protection you want on a file.

If 1:N isn't enough for you, RS can be generalized to any number of
redundant blocks. Unfortunately, doing so requires modular aritmetic
which current CPUs are not too impressively fast at. However, the
Linux Raid-6 code demonstrates that two part parity can be done quite
quickly in software.

As such, I think 'ecc' is useless.. checksums are useful because they
are cheap and allow us to use cheap erasure coding (which could be in
a lower levle raid driver, or implemented in the FS) to achieve data
integrity.

The question of including error coding in the FS or in a lower level
is, as far as I'm concerned, so clear a matter that it is hardly worth
discussing anymore.  In my view it is absolutely idiotic to place
redundancy in a lower level.

The advantage of placing redundancy in a lower level is code
simplicity and sharing.

The problem with doing so, however, is many fold.

The redundancy requirements for various parts of the file system
differ dramatically, without tight FS integration matching the need to
the service is nearly impossible.

The most important reason, however, is performance.  Raid-5 (and
raid-6) suffer a tremendous performance hit because of the requirement
to write a full stripe OR execute a read modify write cycle.  With FS
integrated erasure codes it is possible to adjust the layout of the
written blocks to ensure that every write is a full stripe write,
effectively you adjust the stripe width with every write to ensure
that the write always spans all the disks.  Alternatively you can
reduce the number of stripe chunks  (i.e. number of disks) in the
parity computation to make the write fit (although doing so wastes
space)...

FS redundancy integration also solves the layout problem. From my
experience most systems with hardware raid are getting far below
optimal performance because even when their FS is smart enough to do
file allocation in a raid aware way (XFS and to a lesser extent
EXT2/3) this is usually foiled by the partition table at the beginning
of the raid device. Resulting in 1:N FS blocks actually spanning two
disks! (thus reading that block incurres potentially 2x disk latency).

Seperated FS and redundancy layers are an antiquated concept.. The
FS's job is to provide reliable storage, fully stop.  It's shocking to
see that a dinosaur like SUN has figured this out but the free
software community still fights against it.

Re: the " 'official' point of view" expressed by kernelnewbies.org

Reply via email to