Re: Recovering from bad disk ...

Michael_google gmail_Gersten Thu, 10 Mar 2011 19:25:56 -0800

This thread is full of heated rhetoric.

First: Silent Data Corruption. I was looking for Google's report on
what they found (scarily high), but could not find it. Here's a report
on what CERN found:


http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191

The bottom line:

Even if the buffer of data that your program passes to write() is
correct, the following has to happen:

1. It gets copied into the kernel cache, waiting to go out.
2. It gets copied to the disk controller.
3. It goes over the wire to the disk's internal controller
4. It goes into the disk's internal memory buffer
5. It waits for the head to reach the write place.
6. It goes out in some form of head charge reversals, or some other
media specific recording form.

At any stage of the game, the data can be in error. Any non-error
correcting memory can be bad. Any copy can be wrong.

Unless your file system internally does some sort of duplication --
and some forms of raid do -- the only thing you can detect is that on
read if the read head's error rate was higher than the drive's error
correcting capability.

Unless your program is going to read back the data that was just
written -- AFTER making sure that everything is flushed out (and no,
neither fflush() nor fsync() are enough) and the kernel's cache
emptied -- there is no way to validate the data.

There was a huge surprise about 10 (?) years ago, when it was
discovered (the hard way, as I understand it) that the kernel's
re-ordering of disk IO blocks (and in fairness, even the drive itself
may reorder stuff) was sufficient to ruin the accuracy of journaled
databases. You know, software that went to great trouble to ensure
that you got either all of one transaction, or none of that
transaction, using secondary copies of what was written to catch
errors on recovery? Yea, turned out that reordering of writes killed
the accuracy of the system.
---
>From fsync()'s man page:
     Specifically, if the drive loses power or the OS crashes, the application
     may find that only some or none of their data was written.  The disk
     drive may also re-order the data so that later writes may be present,
     while earlier writes are not.

     This is not a theoretical edge case.  This scenario is easily reproduced
     with real world workloads and drive power failures.

     For applications that require tighter guarantees about the integrity of
     their data, Mac OS X provides the F_FULLFSYNC fcntl.  The F_FULLFSYNC
     fcntl asks the drive to flush all buffered data to permanent storage.
     Applications, such as databases, that require a strict ordering of writes
     should use F_FULLFSYNC to ensure that their data is written in the order
     they expect.  Please see fcntl(2) for more detail.
---

Historically, for *most* flavors of unix, the "raw" device simply
meant that when you asked for block X from the drive, the kernel
ignored the kernel's buffer cache, and asked the drive for block X
directly. Sometimes, on some machines, the kernel would send the
drive's output by DMA directly to your buffer. This meant that the
kernel was not reading into the buffer cache, and then copying to your
memory space. So, the raw device was much faster -- but a lot of
internal information might not be seen as it was still in memory. For
years, sync() was not guaranteed to write all data from the kernel to
the drives on all systems -- sometimes the default behavior included
not writing some modified inode information.

If you are talking about physical drive access, then there are two
different things to be talking about.

One is surface, cylinder, sector accessing. You can do this on really
old drives, or by some hidden, undocumented testing interface on
modern drives. But you can't do this reliably as an end user.

The second is "by block number, but the block numbers that the drive
presents to the computer is arbitrary" LBA addressing. You ask for
block number zero, and the drive gives you a block. Where on the disk
is that block actually located? You'll never know without some drive
specific (or vendor specific) code.

===

* NOTHING * in the kernel should be responsible for the accuracy of
data written out.
* NOTHING * in the kernel should have any knowledge of file systems or
file system formats.

Old unix -- version 7, and I think system 3 -- did. They had knowledge
of one specific file system format in the kernel itself.

Later, things changed. File system code and kernel code got separated.
NFS kinda made it a requirement. The kernel's inode mapping became a
vnode mapping. But even then, there was a big set of assumptions about
what kind of pairs could be on the top or bottom of a vnode mapping.

Then, came a system called "Ficus". (This was from my time at UCLA).
Ficus made kernel changes to allow arbitrary mappings on the vnodes --
and, in particular, you could have a stack of vnode layers. You could
even implement a vnode driver in user space. Imagine a class of grad
students where in a two week project, some solo, some teams of two,
people were able to implement a compression layer, an encryption
layer, etc. Now imagine that the system allows you to arbitrarily
stack node after node and make a customized storage solution.

Ficus was intended to be a replicated, distributed storage system.
They decided to use NFS as the distribution system, and wrote layers
to go between the kernel, the file system, and NFS, to handle
replication. You could use any file system as the back ends -- when
the kernel thought it was talking to a file system, it talked to a
replicator. That replicator tracked which file systems had current
copies, out of date copies, etc. Didn't care if the file system was
local, NFS, or some other layer that wasn't actually storage. Didn't
care if the storage was only occasionally connected, as long as when
you went to read at least one of the current copies was available. In
a two week assignment, they demonstrated that the design was actually
flexible enough to be arbitrarily extended. You no longer had to code
encryption into a file system. You no longer had to code quotas into a
file system.

Take a good look at modern file systems -- there might be half a dozen
different features. Now lets say you want to make one new low level
concept -- instead of allocating in 512 byte blocks, you want to
allocate in blocks as small as a few bytes or as large as a meg. You
know, Reiser style. Just write one layer for handling allocation, and
use all the existing quota/journaling/etc features.

Ficus worked. People complained that it had too much overhead. Now
we've got systems that are about 1000 times faster; the overhead is
small enough to be noise now. But Ficus is dead. We're back to
monolithic kernel filesystem drivers and kernel level vnode
assumptions.

Modern kernels should not care anything about the file systems.
There is no reason to write monolithic file systems.
Yet ... they do, they are, it continues. ZFS is likely to be the last
word in monolithic file systems; that doesn't mean it's the last word
in file systems.

===

Why is accuracy not something that the kernel should care about? It's
too low level. The application has to check that the entire data is
correct. What if the kernel really tried to check it all?

Alright, I'll verify this data that went to the hard drive. Write,
read, compare -- Good. Ok.

Great, I'm the NFS destination driver. I'll report to the NFS source
that it was fine.

Good, I'm the NFS source writer. I'll report to the kernel that all's well.

Oh, the kernel just asked me to re-read what I just wrote? Ok, another
network request. ...

It kinda tends to magnify the system overhead. And it doesn't even
solve the problem.

BSD systems now have a form of stackable layers. I don't know the
details, just that it's some subset of what the Ficus system could do.
In that environment, you might have the kernel recheck data multiple
times just because it doesn't know how many more layers up are yet to
go.

===
Md5 is vulnerable to attacks. That does not mean that it fails to work
for normal cases.

** NO ** hash system is safe. The fact that you will turn arbitrary
files into short strings means you have to have duplications. There
has to exist many different files that will give you the same hash
string.

It just so happens that "how to construct different files with the
same hash" is known for md5, and as I understand it, sha-1. As I
recall, when SHA-1 was published, the NSA said "Toss in these shift
operations here, we won't tell you why", and that became SHA-2. Years
later, Md5 was broken and Sha-1 was shown weak, but sha-2 was safe.

===

Alright, did I get everything?
_______________________________________________
MacOSX-admin mailing list
[email protected]
http://www.omnigroup.com/mailman/listinfo/macosx-admin

Re: Recovering from bad disk ...

Reply via email to