On 08/07/2016 05:44 PM, Howard Chu wrote: > The only way to guarantee integrity is with ordered writes. All SCSI > devices support this feature, but e.g. the Linux kernel does not (and > neither does SATA, and no idea about PCIe SSDs...). > > Lacking a portable mechanism for ordered writes, you have two choices > for preserving integrity - append-only operation (which forces ordered > writes anyway) or at least one synchronous write somewhere. > > Whenever you decide to reuse existing pages rather than operating as > append-only, you create the possibility of overwriting some required > data before it was safe to do so. Your 3-root checksum scheme *might* > let you detect that the DB is corrupted, but it *won't* let you recover > to a clean state. Given that writes occur in unpredictable order, > without fsyncs there is no way you can guarantee that anything sane is > on the disk.
Consider three roots without any checksums. Each root has a simple flag indicating whether it was written durably (fsync write barrier). During recovery, non-durable roots are simply ignored/discarded. This is equivalent to Hallvard's suggestion for volatile meta-pages. I think it's pretty clear this is workable. From there, checksums just give you slightly stronger guarantees, although they might not be worth the overhead (CPU/storage) and recovery complexity.
