Re: Any news on Merkle tree-hash-based whole-disk checksums (=ZFS-style checksums) in softraid?

Karel Gardas Wed, 02 Dec 2015 01:27:47 -0800

On Tue, Dec 1, 2015 at 11:21 PM, Tinker <[email protected]> wrote:


> So your current solution is *NOT* data-safe toward "mis-write":s and other
> write errors that go unnoticed at write time.

yes, if write error is silent, drive do not know about it nor it
signals to OS, then yes, this will go unnoticed. If drive signals
error, then this is handled by general softraid framework by
off-lining the drive.

> While I agree that the probability that the writes to both disks and to
> their checksum areas would fail are really low, the "hash tree"/"100% hash"
> way of ZFS must be said to be a big enabler because it's an integrity
> preservation/data safety scheme of a completely other, higher level:

:-) I really like ZFS, but you still kind of mix what can be done on
the fs level and what can be done on block-device level.

> The "checksum area" for the whole tree could be located right at the end of
> the disk too, meaning that the "backward compatibility" you describe would
> be preserved too.
>
> You are right that Fletcher is just another hash function with the standard
> definition i.e. hash(data) => hashvalue -
>
> ZFS' magic ingredient is a Merkle tree of hashes that's all.
>
>
> The benefit I see with a hash tree is that you have in RAM always stored a
> hash of the whole disk (and the first level hashes in the hash tree).

IMHO not in case of ZFS. i.e. not in case of yet unread data from ZFS.

> This means that protection against serious transparent write
> errors/mis-write:s goes from none (although implausible) to really solid.

How? I don't see engineering way how would ZFS detect mis-writes which
even drive does not know at all. Yes, ZFS will detect it during the
scrub time (read) or when you access data and read them. But that's
what RAID1C will do too. So where is the difference? (you know I told
you you need scrub for this and that it's also on my todo list)

> I see that the hash-tree could be implemented in a really simple,
> straightforward way:
>
> What about you'd introduce an "über-hash", and then a fixed size of
> "first-level hashes".
>
> The über-hash is a hash of all the first-level hashes, and the first-level
> hashes respectively are a hash of their corresponding set of bottom level
> checksums.

so let's consider write. RAID1C does it as read chksum, write data,
write chksum. If your u"ber hash or first level hashes are smaller
than drive block, then I would probably need to store more than one
into one block (to save space) and this would mean that on write I
will do:

read uber hash, read first-level hash, read chksum, write data, write
chksum, write first level hash, write uber hash. Hence 1 IO -> 7 IOs
while in case of RAID1C just 1 IO -> 3 IOs.

> If for performance you need more levels then so be it, in all cases it can
> be contained right at the end of the disk.
>
> The benefit here is that the über-hash and first level always will be kept
> in RAM.

OK, so you cache those, but still you need to write them -- to
preserve data integrity, right? Then 1 IO -> 4 IOs still.

> This means that as soon as any data or bottom-level checksums go out
> of the disk cache and later on are read from the physical disk, then the
> checking of all that data with the RAM-stored hashes, will give us the
> precious absolute fread() guarantee.

"precious obsolute" -- not at all! Read something about hash
collisions and you will lose your hope. :-)

> (Integrity between reboots will be a slightly more sensitive point. Maybe
> some sysctl could be used to extract the über-hash so you could
doublecheck
> it after reboot.)

I've been also thinking about async chksum write as an option with
possible chksum merges over several real IOs.. The problem is that
this means data on drive are not well-formed for the case of sudden
crash or power-outage. Yes, it may be the option, but still priority
should be IMHO be consistent (data-wise).

>  * Really just a hashtree-based checksummed passthrough discipline would
> make all sense, e.g. JBOD .. or RAID 0.
>
>    RAID 1 is nice but if you have many nodes and you just want Absolute
> fread() integrity on a single machine, hashtree-checksummed passthrough or
> JBOD or RAID 0 might be a preferable "lean and mean" solution.
>
>    In an environment where you have perfect backups, RAID 1's benefit over
> passthrough is that disk degradation happens slightly more gracefully -
> instead of watching for broken file access and halting immediately then,
> then, as administrator you monitor those sysctl:s you introduce, that tell
> if either underlying disk is broken. I must admit that indeed that's pretty
> neat :)
>
>    ..But still it could always happen that both disks break at the same
> time, so also still the passhtorugh usecase is really relevant also.

^ this looks like a set of your own wishes?

>  * Do you do any load balancing of read operations to the underlying
RAID:s,
> like, round robin?

It's like SR-RAID1! So yes.

>  * About the checksum caching, I'm sure you can find some way to cache
those
> so that you need to do less reads of that part of the disk, so the problem
> of lots of reads that you mention in your email will be completely resolved
> - if your code is correct, then the reading overhead from your RAID1C
should
> be almost nonexistent.

There is a misunderstanding here. Since I detect silent error on read
time only (as you pointed out above), I will always read data and
chksum from the drive just to make sure data are well formed there.
The cache of chksum blocks is here to speedup writes.

Generally speaking. I write code for me and for others who perhaps do
have the same requirements. My requirements are:
- chksumming on mirror of drives (RAID1) to make sure I'm able to
survive corrupted data on one drive and heal from that. Kind of what
I'm using with ZFS/Solaris on my development workstation but this time
for OpenBSD on my backup server
- as simple as possible code (to preserve OpenBSD project philosophy:
simplicity)
- avoid hacking in ffs if possible (to preserve and not fight against
OpenBSD project conservativeness)
- provide just a stop-gap solution for now and for let's say upcoming
5-10 years before hammer2 is ready and also merged into OpenBSD.

from this all, I've concluded that hack this on SR-RAID1 layer would
be most easy and hopefully acceptable solution for OpenBSD developers
so there is a chance to merge it into project.

What you try to describe and push forward is more for fs level
hacking. If you are that impatient and can't wait for hammer2, then
just take the code and hack on it for yourself. It should be fun at
the end. For me live's too short and my solution is good enough so
I'll certainly not implement what you suggest since especially this
will be more complex hence taking a lot of more time and probably miss
the target since hammer2 may be faster at the end.

Have fun! Karel

Re: Any news on Merkle tree-hash-based whole-disk checksums (=ZFS-style checksums) in softraid?

Reply via email to