Suppose you're in the not terribly unusual position of having a bunch of big files (megabytes at least) on a bunch of unreliable disks. Every once in a while, one of the disks fails, or user error deletes a file. But you would prefer not to lose the data in any of these files, because it's valuable. What do you do?
A straightforward and reliable solution is to keep multiple copies of each file, say two or three, and a message-digest of each file (such as MD5 or SHA-1) so you can detect silent corruption. When you notice a disk has failed or a file is missing, you make another copy of one of the remaining copies. This scheme is appealingly simple, but the copies are expensive, which creates a temptation to skimp on the mirroring in some applications. ECC --- An error-correcting code transforms some number of bits N into a larger number of bits M, and can do the reverse transformation (usually very quickly) even if you flip or erase some of the M bits. Typically you can erase up to any M-N bits or flip up to any ceil((M-N-1)/2) bits. In many codes, the first N of the M bits are identical to the original M bits --- you just append M-N bits to the original bits. (I think these codes are called RCH codes.) The simplest such code is the parity code: M=N+1, and you XOR your N bits together to get your Mth bit. This can detect single-bit errors, and if you know which bit is lost, it can correct them, too. So if you have N files of about the same size, you can read through them in parallel and generate M-N "ECC files" containing those extra error-correction bits. (For concreteness, we could talk about RS(28,24), a Reed-Solomon code that takes 24 8-bit symbols at a time and adds 4 more to get a total of 28 8-bit symbols.) When you get past the ends of some of the files, you can pretend they have zero bytes there. These ECC files contain enough information to replace any one of the existing files, should it be lost or corrupted. Depending on which code you use, you may be able to replace several lost files, as long as the others aren't lost. (With RS(28,24), you could lose up to four of the existing files, as long as you knew they were lost.) So, generate your ECC files from a group of files that are on separate disks (ideally on separate machines, in separate racks, in separate hosting centers, served by separate networks, on separate continents) and store them on other files on yet other disks (preferably as separate as possible, as well). Then you can recover any of the files if they get lost, at relatively low overhead cost. Finding the original files -------------------------- There's a practical difficulty if you have a bunch of ECC files full of apparently random binary data, though, which is that they're only useful for data reconstruction if you can find the original files they were generated from (and the other ECC files as well). So I suggest beginning the file with some kind of boilerplate header, saying "This is an ECC file generated by ECC-file-mangler version 2.4, available at http://www.example.com/eccfile; see the end of the file for more details", and appending a list of metadata about the input files to the end of the file, so that you can figure out which of your zillions of not-lost files you need to read to reconstruct your lost files. Under the title "filesystem metadata indexing, yet again", at http://lists.canonical.org/pipermail/kragen-hacks/2004-January/000383.html I published a program that will run through your filesystem to compute various metadata about all your files. The result looks something like this: atime: 1074220014 bytes: 1624 ctime: 1072920796 dev_inode: 769_1460288 fips_180_1_sha: a2e62e2d1bfd0e654eeae0b7976478c418dfcaa4 md5: 5da8553a4f8f3f136853c44e8322529c mode: 0100664 mtime: 1072920796 name: ../Toshiba%2520Pin%2520Password%2520Reset.htm uid_gid: 500_500 This contains, among other things, the file's original name, the number of bytes in it. Metadata like this appended to the end of the file would enable you to figure out which among your thousands of big files the ECC file pertains to, and perhaps what to call the reconstructed result.

