On Sat, May 23, 2009 at 4:21 AM, Robert Dionne <[email protected]> wrote: > Fyi, I've been using this branch in my testing lately and everything works > fine except the latest db upgrade changes break hovercraft:test() in the > attachment streaming. The call to couch_doc:bin_foLdl now has different > behavior. The fix was trivial, changing two calls in hovercraft to use the > length function rather than size. > > I'm happy to push a patch to the hovercraft project when this tail append > branch is merged to trunk. >
I'd been planning on writing that, so if you send the patch my way I can apply it now, maybe maintained as its own branch as well until the merge to trunk. Thanks, Chris > Cheers, > > Bob > > > > > On May 22, 2009, at 4:56 PM, Damien Katz wrote: > >> >> On May 18, 2009, at 1:59 PM, Damien Katz wrote: >> >>> In branches/tail_header on the svn repository is an working version of >>> new pure tail append code for CouchDB. >>> >>> Right now in trunk we have zero-overwrite storage, which meant we never >>> overwrite any previously committed data, or meta data, or even index >>> structures. >>> >>> The exception to the rule is the file header, in all previous versions, >>> CouchDB stores the database header in the head of the file, but it's written >>> twice, one after another, each 2k header copy integrity checked. If the >>> power fails in the middle of writing one copy, the other should still be >>> available. >>> >>> With zero-overwrite storage, all btree updates happen at the end of the >>> file, but document summaries were written into a buffer internally, so that >>> docs are written contiguously in buffers (usually) near then end of the >>> file. Big file attachments were written into internally linked buffers also >>> near the end of the file. This design proves very robust, offers reasonable >>> update performance and good document retrieval times. Its weaknesses, like >>> all single storage formats, is if the file gets corrupted, the database >>> maybe be unrecoverable. >>> >>> One form of corruption that's seems fairly common (we've seen it at least >>> 3 times), is file truncation, which is to say the end of the file goes >>> missing. This seems to happen sometimes after a file system fills up, or >>> machine suffers a power loss. >>> >>> Unfortunately, when file truncation happens with CouchDB, it's not just >>> the last blocks of data that are lost, it's the whole file, because the last >>> bits of data it writes is the root btree node that's necessary to find the >>> remaining indexes. It's possible to write a tool to scan back and attempt to >>> find the correct offset pointers to restore the file, but that's pretty >>> expensive and wouldn't always be correct. >>> >>> To fix this, the tail_header branch I created uses something like zero >>> overwrite storage, and takes it a little further and uses append-only >>> storage, with every single update or deletion causing an update to the very >>> end of the file, making the file grow. Even the header is stored at the end >>> of the file (more accurate to be called a trailer I guess). >>> >>> With this design, any file truncation simply results in an earlier >>> version of the database. If a commit is interrupted before the header gets >>> completely written, then the next time the database is open, the commit data >>> is skipped over as it scans backward looking for a valid header. >>> >>> Every 4k, a single byte has either a value of 1 or 0. A value of 1 means >>> a header immediately follows the byte, otherwise it's a regular storage >>> block. Every regular write to the file, if it spans the special byte, is >>> split up and the special byte inserted. When reading from the file, the >>> special bytes are automatically stripped out from the data. >>> >>> When a file is first opened, the header is searched for by scanning back >>> through the blocks, looking for a valid header that passes all the integrity >>> checks. Usually this will be very fast, but could be a long scan depending >>> how much data was written but not before failure. >>> >>> Besides being very robust in the face of truncation, this format has the >>> advantage of potentially speeding up the commits greatly, as everything is >>> written sequentially at the end of the file, allowing tons of data to be >>> written out without ever having to do a head seek. And fsync can be called >>> fewer times now. If you have an application where you don't mind losing your >>> most recent updates, you could turn off fsync all together. However, this >>> assumes ordered-sequential writes, that the FS will never write out the >>> later bytes before the earlier bytes. >>> >>> Large file attachments have more overhead as the files are broken up into >>> ~4k chunks, and stores a point to each chunk. The means opening a document >>> requires also loading up the pointers to each chunk, instead of a single >>> pointer like before. >>> >>> Upsides: >>> - Extremely robust storage format. Data truncations, as caused by OS >>> crashes, incomplete copies, etc, still allow for earlier versions of the >>> database to be recovered. >>> - Faster commit speeds (in theory). >>> - OS level backups are to simply copy the new bytes over. (hmmm but this >>> won't work with compaction or if we automatically truncate to valid header >>> on file open). >>> - Views index updates never require a fsync. (assuming ordered-sequential >>> writes) >>> >>> Downsides: >>> - Every update to the database will have up to 4k of overhead for header >>> writing (the actual header is smaller, but must be written 4k aligned). >>> - Individually updated documents are more sparse on disk by default, >>> making long view builds slower (in theory) as the disk will need to seek >>> forward more often. (but compaction will fix this) >>> - On file open, must seek back through the file to find a valid header. >>> - More overhead for large file attachments. >>> >>> Work to be done: >>> - More options for when to do fsync or not, to optimize for underlying >>> file system (before header write, after header write, not at all, etc) >>> - Rollback? Do we want to support rolling back the file to previous >>> versions? >>> - Truncate on open? - When we open a file, do we want to automatically >>> truncate off any uncommitted garbage that could be left over? >>> - Compact should write attachments in one stage of copying, then the >>> documents themselves, right now attachment and document writes are >>> interleaved per-document. >>> - Live upgrade of 0.9.0. It would be nice to be able to serve old style >>> files to allow for zero downtime on upgrade. Right now the branch doesn't >>> understand old files at all. >>> - Possibly we need to fsync on database file open, since the file might >>> be in the FS cache but not on disk due to a previous CouchDB crash. This can >>> cause problems if the view indexer (or any indexer, like lucene) updates its >>> index and it gets committed to disk, but the most recent version of the >>> database still isn't committed. Then if the OS crashes or powerloss occurs, >>> the index files might unknowingly reflect lost state in the database, which >>> would be fixable only by doing a complete view rebuild. >>> >>> Feedback on all this welcome. Please try out the branch to shake out any >>> bugs or performance problems that might be lurking. >>> >>> -Damien >>> >> >> So I think this patch is ready for trunk. >> >> It now serves old files without downtime and I've tested it out manually, >> but I haven't written any automated tests for it. If you can please try it >> out on a trunk database and view(s) and see if everything still works >> correctly. Also test out compacting the database to fully upgrade it to the >> current format. Note, please make a backup database before doing this, just >> opening an old file with the new code causes it to partially upgrade so that >> previous versions don't recognize it. >> >> This live upgrade code is sprinkled throughout the source and the places >> are marked. We will remove these, probably after the next version (0.10). >> >> The new code has ini configurable fsyncs: >> [couchdb] >> sync_options = [before_header, after_header, on_file_open] >> >> By default, all three options are on, you can turn some or all off in the >> local.ini like this: >> [couchdb] >> sync_options = [] >> >> For default transactions, the header is only written out once per second, >> reducing its size impact particularly in high volume writes. Also the tail >> append stuff gives us the ability to have even more transaction options to >> optimize for different systems, but that can all be done later. >> >> After discovering how much CPU it was eating, I turned the term >> compression stuff completely off. But it should be ini or database >> configurable eventually. As Adam Kocoloski pointed out, regardless how its >> compressed when saved it's always readable later. >> >> It does not have version rollback or "truncate to valid header on open", >> but those are features that can be added later without much work if >> necessary. >> >> Feedback please! >> >> -Damien > > -- Chris Anderson http://jchrisa.net http://couch.io
