Re: Tail Append Headers

Damien Katz Fri, 22 May 2009 13:56:40 -0700


On May 18, 2009, at 1:59 PM, Damien Katz wrote:

In branches/tail_header on the svn repository is an working versionof new pure tail append code for CouchDB.
Right now in trunk we have zero-overwrite storage, which meant wenever overwrite any previously committed data, or meta data, or evenindex structures.
The exception to the rule is the file header, in all previousversions, CouchDB stores the database header in the head of thefile, but it's written twice, one after another, each 2k header copyintegrity checked. If the power fails in the middle of writing onecopy, the other should still be available.
With zero-overwrite storage, all btree updates happen at the end ofthe file, but document summaries were written into a bufferinternally, so that docs are written contiguously in buffers(usually) near then end of the file. Big file attachments werewritten into internally linked buffers also near the end of thefile. This design proves very robust, offers reasonable updateperformance and good document retrieval times. Its weaknesses, likeall single storage formats, is if the file gets corrupted, thedatabase maybe be unrecoverable.
One form of corruption that's seems fairly common (we've seen it atleast 3 times), is file truncation, which is to say the end of thefile goes missing. This seems to happen sometimes after a filesystem fills up, or machine suffers a power loss.
Unfortunately, when file truncation happens with CouchDB, it's notjust the last blocks of data that are lost, it's the whole file,because the last bits of data it writes is the root btree nodethat's necessary to find the remaining indexes. It's possible towrite a tool to scan back and attempt to find the correct offsetpointers to restore the file, but that's pretty expensive andwouldn't always be correct.
To fix this, the tail_header branch I created uses something likezero overwrite storage, and takes it a little further and usesappend-only storage, with every single update or deletion causing anupdate to the very end of the file, making the file grow. Even theheader is stored at the end of the file (more accurate to be calleda trailer I guess).
With this design, any file truncation simply results in an earlierversion of the database. If a commit is interrupted before theheader gets completely written, then the next time the database isopen, the commit data is skipped over as it scans backward lookingfor a valid header.
Every 4k, a single byte has either a value of 1 or 0. A value of 1means a header immediately follows the byte, otherwise it's aregular storage block. Every regular write to the file, if it spansthe special byte, is split up and the special byte inserted. Whenreading from the file, the special bytes are automatically strippedout from the data.
When a file is first opened, the header is searched for by scanningback through the blocks, looking for a valid header that passes allthe integrity checks. Usually this will be very fast, but could be along scan depending how much data was written but not before failure.
Besides being very robust in the face of truncation, this format hasthe advantage of potentially speeding up the commits greatly, aseverything is written sequentially at the end of the file, allowingtons of data to be written out without ever having to do a headseek. And fsync can be called fewer times now. If you have anapplication where you don't mind losing your most recent updates,you could turn off fsync all together. However, this assumes ordered-sequential writes, that the FS will never write out the later bytesbefore the earlier bytes.
Large file attachments have more overhead as the files are broken upinto ~4k chunks, and stores a point to each chunk. The means openinga document requires also loading up the pointers to each chunk,instead of a single pointer like before.
Upsides:
- Extremely robust storage format. Data truncations, as caused by OScrashes, incomplete copies, etc, still allow for earlier versions ofthe database to be recovered.
- Faster commit speeds (in theory).
- OS level backups are to simply copy the new bytes over. (hmmm butthis won't work with compaction or if we automatically truncate tovalid header on file open).- Views index updates never require a fsync. (assuming ordered-sequential writes)
Downsides:
- Every update to the database will have up to 4k of overhead forheader writing (the actual header is smaller, but must be written 4kaligned).- Individually updated documents are more sparse on disk by default,making long view builds slower (in theory) as the disk will need toseek forward more often. (but compaction will fix this)- On file open, must seek back through the file to find a validheader.
- More overhead for large file attachments.

Work to be done:
- More options for when to do fsync or not, to optimize forunderlying file system (before header write, after header write, notat all, etc)- Rollback? Do we want to support rolling back the file to previousversions?- Truncate on open? - When we open a file, do we want toautomatically truncate off any uncommitted garbage that could beleft over?- Compact should write attachments in one stage of copying, then thedocuments themselves, right now attachment and document writes areinterleaved per-document.- Live upgrade of 0.9.0. It would be nice to be able to serve oldstyle files to allow for zero downtime on upgrade. Right now thebranch doesn't understand old files at all.- Possibly we need to fsync on database file open, since the filemight be in the FS cache but not on disk due to a previous CouchDBcrash. This can cause problems if the view indexer (or any indexer,like lucene) updates its index and it gets committed to disk, butthe most recent version of the database still isn't committed. Thenif the OS crashes or powerloss occurs, the index files mightunknowingly reflect lost state in the database, which would befixable only by doing a complete view rebuild.
Feedback on all this welcome. Please try out the branch to shake outany bugs or performance problems that might be lurking.
-Damien


So I think this patch is ready for trunk.

It now serves old files without downtime and I've tested it outmanually, but I haven't written any automated tests for it. If you canplease try it out on a trunk database and view(s) and see ifeverything still works correctly. Also test out compacting thedatabase to fully upgrade it to the current format. Note, please makea backup database before doing this, just opening an old file with thenew code causes it to partially upgrade so that previous versionsdon't recognize it.

This live upgrade code is sprinkled throughout the source and theplaces are marked. We will remove these, probably after the nextversion (0.10).


The new code has ini configurable fsyncs:
  [couchdb]
  sync_options = [before_header, after_header, on_file_open]

By default, all three options are on, you can turn some or all off inthe local.ini like this:

   [couchdb]
   sync_options = []

For default transactions, the header is only written out once persecond, reducing its size impact particularly in high volume writes.Also the tail append stuff gives us the ability to have even moretransaction options to optimize for different systems, but that canall be done later.

After discovering how much CPU it was eating, I turned the termcompression stuff completely off. But it should be ini or databaseconfigurable eventually. As Adam Kocoloski pointed out, regardlesshow its compressed when saved it's always readable later.

It does not have version rollback or "truncate to valid header onopen", but those are features that can be added later without muchwork if necessary.


Feedback please!

-Damien

Re: Tail Append Headers

Reply via email to