Suppose you're in the not terribly unusual position of having a bunch
of big files (megabytes at least) on a bunch of unreliable disks.
Every once in a while, one of the disks fails, or user error deletes a
file.  But you would prefer not to lose the data in any of these
files, because it's valuable.  What do you do?

A straightforward and reliable solution is to keep multiple copies of
each file, say two or three, and a message-digest of each file (such
as MD5 or SHA-1) so you can detect silent corruption.  When you notice
a disk has failed or a file is missing, you make another copy of one
of the remaining copies.

This scheme is appealingly simple, but the copies are expensive, which
creates a temptation to skimp on the mirroring in some applications.

ECC
---

An error-correcting code transforms some number of bits N into a
larger number of bits M, and can do the reverse transformation
(usually very quickly) even if you flip or erase some of the M bits.
Typically you can erase up to any M-N bits or flip up to any
ceil((M-N-1)/2) bits.  In many codes, the first N of the M bits are
identical to the original M bits --- you just append M-N bits to the
original bits.  (I think these codes are called RCH codes.)

The simplest such code is the parity code: M=N+1, and you XOR your N
bits together to get your Mth bit.  This can detect single-bit errors,
and if you know which bit is lost, it can correct them, too.

So if you have N files of about the same size, you can read through
them in parallel and generate M-N "ECC files" containing those extra
error-correction bits.  (For concreteness, we could talk about
RS(28,24), a Reed-Solomon code that takes 24 8-bit symbols at a time
and adds 4 more to get a total of 28 8-bit symbols.)  When you get
past the ends of some of the files, you can pretend they have zero
bytes there.

These ECC files contain enough information to replace any one of the
existing files, should it be lost or corrupted.  Depending on which
code you use, you may be able to replace several lost files, as long
as the others aren't lost.  (With RS(28,24), you could lose up to four
of the existing files, as long as you knew they were lost.)

So, generate your ECC files from a group of files that are on separate
disks (ideally on separate machines, in separate racks, in separate
hosting centers, served by separate networks, on separate continents)
and store them on other files on yet other disks (preferably as
separate as possible, as well).  Then you can recover any of the files
if they get lost, at relatively low overhead cost.

Finding the original files
--------------------------

There's a practical difficulty if you have a bunch of ECC files full
of apparently random binary data, though, which is that they're only
useful for data reconstruction if you can find the original files they
were generated from (and the other ECC files as well).  So I suggest
beginning the file with some kind of boilerplate header, saying "This
is an ECC file generated by ECC-file-mangler version 2.4, available at
http://www.example.com/eccfile; see the end of the file for more
details", and appending a list of metadata about the input files to
the end of the file, so that you can figure out which of your zillions
of not-lost files you need to read to reconstruct your lost files.

Under the title "filesystem metadata indexing, yet again", at
http://lists.canonical.org/pipermail/kragen-hacks/2004-January/000383.html
I published a program that will run through your filesystem to compute
various metadata about all your files.  The result looks something
like this:

atime: 1074220014
bytes: 1624
ctime: 1072920796
dev_inode: 769_1460288
fips_180_1_sha: a2e62e2d1bfd0e654eeae0b7976478c418dfcaa4
md5: 5da8553a4f8f3f136853c44e8322529c
mode: 0100664
mtime: 1072920796
name: ../Toshiba%2520Pin%2520Password%2520Reset.htm
uid_gid: 500_500

This contains, among other things, the file's original name, the
number of bytes in it.  Metadata like this appended to the end of the
file would enable you to figure out which among your thousands of big
files the ECC file pertains to, and perhaps what to call the
reconstructed result.

Reply via email to