> To answer your question: In that case, as soon as that invalid data would > actually be read from disk, it would be caught by the checksums that are > guaranteed to be kept in RAM, so that is, the first-level checkums (or the > über-checksum) match would fail.
Ah, ok, but then this is safe only as long as you do not switch machine off. So in a catastrophic scenario like you describe where all writes fails to all drives, you switch machine off in a hope everything's all right and when you switch it on again your read should probably return old data. Right? > Btw, any checksum algorithm would work for implementing a tree like this by > the way, even CRC64 I guess. So Fletcher as such is out the window. I intend > to followup on your other emails in some hours. You do not realise how expensive this (I mean whole this tree chksumming business) is on fsync. Single proof of this is database benchmark. I'm using pgbench. Try to run that and see for yourself, but if you believe me and trust my numbers then on OpenBSD 1 client bench setup I'm able to get ~1190 tps on RAID1, ~950 tps on RAID1C and guess what, just ~100 tps on ZFS on Solaris 11.3. So yes, ZFS is great at caching writes and optimising writes this way, but once you insist on fsync, then bad performance happen. Side note: I'm really looking forward how hammer2 is going to solve that. Another side note: even in current RAID1C I can do delay writes (like ZFS), optimize and merge chksum computation and writes this way. I can even read chksum from different chunk than actual data are read to mitigate your all writes mis-directs on bad drive scenario (avoiding all drives fail scenario) but then the result will be more complex code, with the former way much complex than with the later which is easier actually. But based on what I've seen so far adding another layer or even two of them for another chksums and properly caching this, I'm afraid the complexity would go over the roof completely here and would not be considered OpenBSD-like or OpenBSD friendly solution anymore. Anyway, as you like the scheme, please take the code and hack it together. If this is that fantastic and works well I would be your first loyal user believe me.

