On 7/7/14, 3:39 PM, Udo Grabowski (IMK) wrote:
> On 07/07/2014 15:09, Saso Kiselkov wrote:
>> On 7/7/14, 2:41 PM, Neal H. Walfield wrote:
>>> Hi,
>>>
>>> It is possible for a bit-flip to change data that is to be written to
>>> disk after its hash has been computed, but before it has been sent to
>>> disk.  This is primarily a concern for systems without ECC RAM.
>>>
>>> It is possible to correct (some of) these errors by including some
>>> forward error correction bits in the hash (or, perhaps next to it, but
>>> we should include some FEC bits for the hash itself since it too could
>>> be corrupted).  It wouldn't have to be more than a few bits, since we
>>> expect at most a bit flip or two for any given block of data.
>>>
>>> Would there by interest in such an extension?
>>>
>>> It is conceivable that a lot of redundancy could be useful: such a
>>> scheme could correct bad blocks on disk.  This would primarily be
>>> useful on systems with just a single drive (e.g., a laptop) or when
>>> resilvering a mirror vdev and the remain disk has a block block (this,
>>> unfortunately has happened to me).
>>>
>>> Thoughts?
>>>
>>> (If this is the wrong place for such questions, please tell where to
>>> post instead.)
>>>
>>> Thanks!
>>>
>>> :) Neal
>>
>> While it'd relatively straighforward to modify ZFS to support a full FEC
>> in its checksum field, I don't see much reason for it for primarily the
>> reason that single-bit or few-bit errors on disk are exceedingly rare
>> (in fact I or anybody I've talked to haven't seen any). Corruption from
>> storage media seems to come in two flavors:
>>
>> 1) Read errors when the drive's ECC detected the error but couldn't
>> correct it. This is equivalent to corruption in all the bits of the
>> block, which nothing besides a fully redundant copy can fix (i.e.
>> copies=2 or a mirror).
>>
>> 2) Massive corruption when the bit errors exceed the drive's ECC's
>> _detection_ threshold (hamming distance greater than 2x the error
>> correction threshold) and so most likely would overwhelm ours as well
>> (256 bits of ECC for 1 megabit or 0.02% of data isn't exactly much -
>> your typical FEC transmission scheme usually reserves around 10-20% of
>> raw channel bandwidth for FEC), or misdirected reads/writes, which also
>> results with an overwhelming probability in a block that's massively
>> different.
>>
>> So while I can see merit in the idea, I don't see much practical need
>> for it. But perhaps I'm not seeing something, so please do correct me if
>> I'm wrong.
>>
> 
> What we are talking about is correcting ECC RAM errors, not disk
> errors. ZFS is susceptible to that type of errors. While you can
> argue that everyone who talks Zfs should use ECC, the real world
> does not give us cheap boards for individuals that support that
> (remember, controller, CPU, memory, AND firmware must support
> ECC to get it working, and usually one ore more of these are not
> working or enabled on consumer boards).
> So this discussion makes sense.

Ah, sorry, my bad, I skimmed the original e-mail. Unfortunately, this
proposal makes even less sense.

The amount of time a block spends in memory after having had its
checksum computed is pretty slim (checksum computation is performed in
the TXG syncing phase just before the block leaves to the HBA). It's
much more probable that the application working on the data and
supplying the result to ZFS for storage would have its bits flipped
(which this scheme won't detect) or that the block's bits would flip
while it's waiting in the TXG for commit to storage (which this scheme
won't detect either).

PCI-express and other internal interconnects already include ECC, so
that's not needed.

Now as for disk controllers, most of the ones with any significant
amount of write buffering and cache are in larger multi-disk storage
systems, so there you'll also be using multi-disk redundancy (keep in
mind we're talking about an event here that's orders of magnitudes less
likely than a disk failing, so not running multi-disk redundancy but
worrying about in-memory errors is really kind of doing it backwards).
Still, there's some rather small potential here for corruption without ECC.

Then come the storage (SAS, SATA, FC, etc.) buses, all of which also
include ECC in their transmission paths.

Lastly the disk systems themselves.

So the one valid potential problem point here is the storage controller
on some systems rather low-cost systems and a small amount of time in
main memory between finishing up checksum computation and DMA transfer
to the controller.

Frankly, unless I see more empirical and most importantly quantitative
data on how big a problem this is, I'm not convinced it's worth it. Neal
mentioned that he's had a corrupted block happen on mirror resilver, but
that's not evidence that this scheme would have averted that:

1) What was the cause for the damaged block?
2) How extensive was the damage to the block?

My guess is the answers to those would be most typical scenario
(speculating here):

1) On-platter magnetic bit rot.
2) Very large to the point where the drive's ECC miscorrected the
   error (errors beyond detection threshold).

Cheers,
-- 
Saso
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to