This thread is full of heated rhetoric. First: Silent Data Corruption. I was looking for Google's report on what they found (scarily high), but could not find it. Here's a report on what CERN found:
http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191 The bottom line: Even if the buffer of data that your program passes to write() is correct, the following has to happen: 1. It gets copied into the kernel cache, waiting to go out. 2. It gets copied to the disk controller. 3. It goes over the wire to the disk's internal controller 4. It goes into the disk's internal memory buffer 5. It waits for the head to reach the write place. 6. It goes out in some form of head charge reversals, or some other media specific recording form. At any stage of the game, the data can be in error. Any non-error correcting memory can be bad. Any copy can be wrong. Unless your file system internally does some sort of duplication -- and some forms of raid do -- the only thing you can detect is that on read if the read head's error rate was higher than the drive's error correcting capability. Unless your program is going to read back the data that was just written -- AFTER making sure that everything is flushed out (and no, neither fflush() nor fsync() are enough) and the kernel's cache emptied -- there is no way to validate the data. There was a huge surprise about 10 (?) years ago, when it was discovered (the hard way, as I understand it) that the kernel's re-ordering of disk IO blocks (and in fairness, even the drive itself may reorder stuff) was sufficient to ruin the accuracy of journaled databases. You know, software that went to great trouble to ensure that you got either all of one transaction, or none of that transaction, using secondary copies of what was written to catch errors on recovery? Yea, turned out that reordering of writes killed the accuracy of the system. --- >From fsync()'s man page: Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not. This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures. For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail. --- Historically, for *most* flavors of unix, the "raw" device simply meant that when you asked for block X from the drive, the kernel ignored the kernel's buffer cache, and asked the drive for block X directly. Sometimes, on some machines, the kernel would send the drive's output by DMA directly to your buffer. This meant that the kernel was not reading into the buffer cache, and then copying to your memory space. So, the raw device was much faster -- but a lot of internal information might not be seen as it was still in memory. For years, sync() was not guaranteed to write all data from the kernel to the drives on all systems -- sometimes the default behavior included not writing some modified inode information. If you are talking about physical drive access, then there are two different things to be talking about. One is surface, cylinder, sector accessing. You can do this on really old drives, or by some hidden, undocumented testing interface on modern drives. But you can't do this reliably as an end user. The second is "by block number, but the block numbers that the drive presents to the computer is arbitrary" LBA addressing. You ask for block number zero, and the drive gives you a block. Where on the disk is that block actually located? You'll never know without some drive specific (or vendor specific) code. === * NOTHING * in the kernel should be responsible for the accuracy of data written out. * NOTHING * in the kernel should have any knowledge of file systems or file system formats. Old unix -- version 7, and I think system 3 -- did. They had knowledge of one specific file system format in the kernel itself. Later, things changed. File system code and kernel code got separated. NFS kinda made it a requirement. The kernel's inode mapping became a vnode mapping. But even then, there was a big set of assumptions about what kind of pairs could be on the top or bottom of a vnode mapping. Then, came a system called "Ficus". (This was from my time at UCLA). Ficus made kernel changes to allow arbitrary mappings on the vnodes -- and, in particular, you could have a stack of vnode layers. You could even implement a vnode driver in user space. Imagine a class of grad students where in a two week project, some solo, some teams of two, people were able to implement a compression layer, an encryption layer, etc. Now imagine that the system allows you to arbitrarily stack node after node and make a customized storage solution. Ficus was intended to be a replicated, distributed storage system. They decided to use NFS as the distribution system, and wrote layers to go between the kernel, the file system, and NFS, to handle replication. You could use any file system as the back ends -- when the kernel thought it was talking to a file system, it talked to a replicator. That replicator tracked which file systems had current copies, out of date copies, etc. Didn't care if the file system was local, NFS, or some other layer that wasn't actually storage. Didn't care if the storage was only occasionally connected, as long as when you went to read at least one of the current copies was available. In a two week assignment, they demonstrated that the design was actually flexible enough to be arbitrarily extended. You no longer had to code encryption into a file system. You no longer had to code quotas into a file system. Take a good look at modern file systems -- there might be half a dozen different features. Now lets say you want to make one new low level concept -- instead of allocating in 512 byte blocks, you want to allocate in blocks as small as a few bytes or as large as a meg. You know, Reiser style. Just write one layer for handling allocation, and use all the existing quota/journaling/etc features. Ficus worked. People complained that it had too much overhead. Now we've got systems that are about 1000 times faster; the overhead is small enough to be noise now. But Ficus is dead. We're back to monolithic kernel filesystem drivers and kernel level vnode assumptions. Modern kernels should not care anything about the file systems. There is no reason to write monolithic file systems. Yet ... they do, they are, it continues. ZFS is likely to be the last word in monolithic file systems; that doesn't mean it's the last word in file systems. === Why is accuracy not something that the kernel should care about? It's too low level. The application has to check that the entire data is correct. What if the kernel really tried to check it all? Alright, I'll verify this data that went to the hard drive. Write, read, compare -- Good. Ok. Great, I'm the NFS destination driver. I'll report to the NFS source that it was fine. Good, I'm the NFS source writer. I'll report to the kernel that all's well. Oh, the kernel just asked me to re-read what I just wrote? Ok, another network request. ... It kinda tends to magnify the system overhead. And it doesn't even solve the problem. BSD systems now have a form of stackable layers. I don't know the details, just that it's some subset of what the Ficus system could do. In that environment, you might have the kernel recheck data multiple times just because it doesn't know how many more layers up are yet to go. === Md5 is vulnerable to attacks. That does not mean that it fails to work for normal cases. ** NO ** hash system is safe. The fact that you will turn arbitrary files into short strings means you have to have duplications. There has to exist many different files that will give you the same hash string. It just so happens that "how to construct different files with the same hash" is known for md5, and as I understand it, sha-1. As I recall, when SHA-1 was published, the NSA said "Toss in these shift operations here, we won't tell you why", and that became SHA-2. Years later, Md5 was broken and Sha-1 was shown weak, but sha-2 was safe. === Alright, did I get everything? _______________________________________________ MacOSX-admin mailing list [email protected] http://www.omnigroup.com/mailman/listinfo/macosx-admin
