Re: Any news on Fletcher checksums (=ZFS-style checksums) in softraid?

Karel Gardas Tue, 01 Dec 2015 13:24:28 -0800

Tinker, what you basically try to describe as Fletcher is kind of how
ZFS is working. The Fletcher on the other hand is simple checksumming
algorithm. Please read something about ZFS design to know more about
it.
Now, what I did for RAID1 to become RAID1C is just to divide data area
of RAID1 to data area and chksum area. So layout is: <softraid meta
data><data area><chksum area>. Also algorithm of placing chksums of
blocks is simply linear so far. That means: 1st block of data area is
CRC32ed into first 8 bytes of chksum area. 2nd block of data area is
CRC32ed into 2nd 8 bytes of chksum area. etc. For simplicity every 32k
of data in data area maps into 512 bytes (1 sector) of chksum area. As
you can see this is really as simple as possible and if you create ffs
in data area then if you force attach the drive as plain RAID1 you
still get the same data drive minus chksum area data amount (ffs
wise!) which means compatibility is preserved -- this is for case you
really like to get data out of RAID1C for whatever reason. This design
also supports detecting of your silently remapped block issue: Let's
have data block X and Y, both chksummed in CHX and CHY blocks in
chksum area. Now if you silently remap X -> Y, then X (on place of Y)
will not match with CHY. That's the case where both X and Y are in
data area. When not, then I assume your X is in data area and Y may be
either in  metadata area or in chksum area. in former case, meta-data
consistency is protected by MD5 sum (note: I have not tested
self-healing of this in this case). In the later case, by remapping X
to Y in chksum area you will basically corrupt chksum for a lot of
blocks in data area which will get detected and healed from the good
block(s) from good drive.
You also ask about I/O overhead. For read, you need to do: read data +
read chksum -- so 1 IO -> 2 IOs. For write it's more difficult:
generally you need to read chksum, write data, write new chksum. So 1
IO -> 3 IOs. This situation may be optimized to just 2 IOs in case of
32k aligned data write where the result is exactly alligned chksum
block(s) and so you don't need to read chksum, but just write
straight. That's also the reason why it's so important perfromance
wise to use 32k blocks fs on RAID1C. As I wrote I also tried to get
rid of read chksum (for general write) by using chksum blocks cache
but so far w/o success, read: it's buggy and corrupts data so far,
well I'm still just softraid beginner anyway and the problem is in not
knowing what upper layer (fs) and perhaps also on lower layer (scsi)
do which I don't know at all, I just try to fill the middle (sr) with
my code. Bad well man needs to learn, right. :-)
Last note: you talk about one RAID partition. Well, then no, neither
RAID1 nor RAID1C is for you since you need at least 2 RAID partitions
for this case, please read bioctl(8).




On Tue, Dec 1, 2015 at 9:03 PM, Tinker <[email protected]> wrote:
> Sorry for the spam - this is my last post before your next response.
>
> My best understanding is that within your RAID1C, Fletcher could work as a
> "CRC32 on steroids", because it would not only detect error when reading
> sectors/blocks that are broken because they contain inadvertently moved
> data, but also it would detect error when reading sectors/blocks where the
> write *did not go through*.
>
> In such a case, perhaps a disk mirror, or your self-healing area, could help
> figure out what should actually be on that provenly incorrect sector.
>
> This is awesome as it cements fread() integrity guarantees.
>
> The price it comes at, I guess, is a slight overhead (which is that the
> upper branches in the tree need to be updated), and also perhaps if there's
> a power failure that leaves the hash tree corrupt, correcting it would be
> pretty nasty - but that may be the whole point with it, that you're in a
> place where there always are backups and you just want to maximize the read
> correctness guarantees.
>
> For anything important I'd easily prefer to use that.
>
>
>
> On 2015-12-02 03:40, Tinker wrote:
>>
>> Just to illustrate the case. This is just how I got that it works,
>> please pardon the amateur level on algorithm details here.
>>
>> With the Fletcher checksumming, say that you have the Fletcher
>> checksum in a tree structure of two levels: One at the disk root, one
>> for every 100MB of data on the disk.
>>
>> When you read any given sector on the disk, it will be checked for
>> consistency with those two checksums, and if there's a failure,
>> fread() will fail.
>>
>>
>> Example: I write to sector/block X which is at offset 125MB.
>>
>> That means the root checksum and the 100MB-200MB branch checksums are
>> updated.
>>
>>
>> I now shut down and start my machine again, and now block/sector X
>> changed mapping with some random block/sector Y located at offset
>> 1234MB.
>>
>> Consequently, any fread() both of sector X and of sector Y will fail
>> deterministically, because both the root checksum and the 100-200MB
>> checksum and the 1200-1300MB checksum checks would fail.
>>
>>
>> Reading other parts of the disk would work though.
>>
>>
>> On 2015-12-02 03:31, Tinker wrote:
>>>
>>> Hi Karel,
>>>
>>> Glad to talk to you.
>>>
>>> Why the extra IO expense?
>>>
>>>
>>> About the Fletcher vs not Fletcher thing, can you please explain to me
>>> what happens in a setup where I have one single disk with one single
>>> RAID partition on it using your disciple, and..
>>>
>>>  1) I write a sector/block on some position X
>>>
>>>  2) My disk's allocation table gets messed up so it's moved to another
>>> random position Y
>>>
>>>  3) I read sector/block on position Y
>>>
>>>  4) Also I read sector/block on position X

Re: Any news on Fletcher checksums (=ZFS-style checksums) in softraid?

Reply via email to