> To answer your question: In that case, as soon as that invalid data would
> actually be read from disk, it would be caught by the checksums that are
> guaranteed to be kept in RAM, so that is, the first-level checkums (or the
> über-checksum) match would fail.

Ah, ok, but then this is safe only as long as you do not switch
machine off. So in a catastrophic scenario like you describe where all
writes fails to all drives, you switch machine off in a hope
everything's all right and when you switch it on again your read
should probably return old data. Right?

> Btw, any checksum algorithm would work for implementing a tree like this by
> the way, even CRC64 I guess. So Fletcher as such is out the window. I
intend
> to followup on your other emails in some hours.

You do not realise how expensive this (I mean whole this tree
chksumming business) is on fsync. Single proof of this is database
benchmark. I'm using pgbench. Try to run that and see for yourself,
but if you believe me and trust my numbers then on OpenBSD 1 client
bench setup I'm able to get ~1190 tps on RAID1, ~950 tps on RAID1C and
guess what, just ~100 tps on ZFS on Solaris 11.3. So yes, ZFS is great
at caching writes and optimising writes this way, but once you insist
on fsync, then bad performance happen. Side note: I'm really looking
forward how hammer2 is going to solve that.

Another side note: even in current RAID1C I can do delay writes (like
ZFS), optimize and merge chksum computation and writes this way. I can
even read chksum from different chunk than actual data are read to
mitigate your all writes mis-directs on bad drive scenario (avoiding
all drives fail scenario) but then the result will be more complex
code, with the former way much complex than with the later which is
easier actually. But based on what I've seen so far adding another
layer or even two of them for another chksums and properly caching
this, I'm afraid the complexity would go over the roof completely here
and would not be considered OpenBSD-like or OpenBSD friendly solution
anymore.

Anyway, as you like the scheme, please take the code and hack it
together. If this is that fantastic and works well I would be your
first loyal user believe me.

Reply via email to