Karel, the most important thing at the bottom of the email :)

On 2015-12-02 19:10, Karel Gardas wrote:
To answer your question: In that case, as soon as that invalid data would actually be read from disk, it would be caught by the checksums that are guaranteed to be kept in RAM, so that is, the first-level checkums (or the
über-checksum) match would fail.

Ah, ok, but then this is safe only as long as you do not switch
machine off. So in a catastrophic scenario like you describe where all
writes fails to all drives, you switch machine off in a hope
everything's all right and when you switch it on again your read
should probably return old data. Right?

First, that is an extremely high data security standard.

Second, by the total hash could be checked before and after reboot.

Third, the very much more likely time for a disk to break would be either during ordinary operation (disk made a mis-write, and you detect that when ), and very much secondarily while being turned off. I think problems with flushing data exactly at the time when turning off a disk would be extremely rare.


Btw, any checksum algorithm would work for implementing a tree like this by
the way, even CRC64 I guess. So Fletcher as such is out the window. I
intend
to followup on your other emails in some hours.

You do not realise how expensive this (I mean whole this tree
chksumming business) is on fsync. Single proof of this is database
benchmark. I'm using pgbench. Try to run that and see for yourself,
but if you believe me and trust my numbers then on OpenBSD 1 client
bench setup I'm able to get ~1190 tps on RAID1, ~950 tps on RAID1C and
guess what, just ~100 tps on ZFS on Solaris 11.3. So yes, ZFS is great
at caching writes and optimising writes this way, but once you insist
on fsync, then bad performance happen. Side note: I'm really looking
forward how hammer2 is going to solve that.




Aha, point taken that fsync() would be slow. However, for any IO that not involves constant fsync():s, performance should be pretty fine, no? (And what about fsyc():s on SSD:s.. anyhow not relevant to my usecase.)

Also point taken that ZFS does have some overhead.

The point with me though is that I'd be happy to "pay" that.. and I believe it can be made much less than the 85% overhead seen in your example benchmark here.

Another side note: even in current RAID1C I can do delay writes (like
ZFS), optimize and merge chksum computation and writes this way. I can
even read chksum from different chunk than actual data are read to
mitigate your all writes mis-directs on bad drive scenario (avoiding
all drives fail scenario) but then the result will be more complex
code, with the former way much complex than with the later which is
easier actually. But based on what I've seen so far adding another
layer or even two of them for another chksums and properly caching
this, I'm afraid the complexity would go over the roof completely here
and would not be considered OpenBSD-like or OpenBSD friendly solution
anymore.

Anyway, as you like the scheme, please take the code and hack it
together. If this is that fantastic and works well I would be your
first loyal user believe me.

Delaying writes would be all fine with me.


What causes the code to be complex here?


I would guess that a beautiful way to implement this hashing would be atop your RAID1C! :D

What about a total hash, and then one level or max two levels of hashes under it?

You already implemented caching of checksums and the logics to maintain their reserved area. And an individual disk's size will never change, that should help to keep complexity under control I guess.


I would guess the whole total-checksum functionality could be done in 1000-2000 locs. Feels realistic?

Reply via email to