On May 18, 2009, at 1:59 PM, Damien Katz wrote:
In branches/tail_header on the svn repository is an working version
of new pure tail append code for CouchDB.
Right now in trunk we have zero-overwrite storage, which meant we
never overwrite any previously committed data, or meta data, or even
index structures.
The exception to the rule is the file header, in all previous
versions, CouchDB stores the database header in the head of the
file, but it's written twice, one after another, each 2k header copy
integrity checked. If the power fails in the middle of writing one
copy, the other should still be available.
With zero-overwrite storage, all btree updates happen at the end of
the file, but document summaries were written into a buffer
internally, so that docs are written contiguously in buffers
(usually) near then end of the file. Big file attachments were
written into internally linked buffers also near the end of the
file. This design proves very robust, offers reasonable update
performance and good document retrieval times. Its weaknesses, like
all single storage formats, is if the file gets corrupted, the
database maybe be unrecoverable.
One form of corruption that's seems fairly common (we've seen it at
least 3 times), is file truncation, which is to say the end of the
file goes missing. This seems to happen sometimes after a file
system fills up, or machine suffers a power loss.
Unfortunately, when file truncation happens with CouchDB, it's not
just the last blocks of data that are lost, it's the whole file,
because the last bits of data it writes is the root btree node
that's necessary to find the remaining indexes. It's possible to
write a tool to scan back and attempt to find the correct offset
pointers to restore the file, but that's pretty expensive and
wouldn't always be correct.
To fix this, the tail_header branch I created uses something like
zero overwrite storage, and takes it a little further and uses
append-only storage, with every single update or deletion causing an
update to the very end of the file, making the file grow. Even the
header is stored at the end of the file (more accurate to be called
a trailer I guess).
With this design, any file truncation simply results in an earlier
version of the database. If a commit is interrupted before the
header gets completely written, then the next time the database is
open, the commit data is skipped over as it scans backward looking
for a valid header.
Every 4k, a single byte has either a value of 1 or 0. A value of 1
means a header immediately follows the byte, otherwise it's a
regular storage block. Every regular write to the file, if it spans
the special byte, is split up and the special byte inserted. When
reading from the file, the special bytes are automatically stripped
out from the data.
When a file is first opened, the header is searched for by scanning
back through the blocks, looking for a valid header that passes all
the integrity checks. Usually this will be very fast, but could be a
long scan depending how much data was written but not before failure.
Besides being very robust in the face of truncation, this format has
the advantage of potentially speeding up the commits greatly, as
everything is written sequentially at the end of the file, allowing
tons of data to be written out without ever having to do a head
seek. And fsync can be called fewer times now. If you have an
application where you don't mind losing your most recent updates,
you could turn off fsync all together. However, this assumes ordered-
sequential writes, that the FS will never write out the later bytes
before the earlier bytes.
Large file attachments have more overhead as the files are broken up
into ~4k chunks, and stores a point to each chunk. The means opening
a document requires also loading up the pointers to each chunk,
instead of a single pointer like before.
Upsides:
- Extremely robust storage format. Data truncations, as caused by OS
crashes, incomplete copies, etc, still allow for earlier versions of
the database to be recovered.
- Faster commit speeds (in theory).
- OS level backups are to simply copy the new bytes over. (hmmm but
this won't work with compaction or if we automatically truncate to
valid header on file open).
- Views index updates never require a fsync. (assuming ordered-
sequential writes)
Downsides:
- Every update to the database will have up to 4k of overhead for
header writing (the actual header is smaller, but must be written 4k
aligned).
- Individually updated documents are more sparse on disk by default,
making long view builds slower (in theory) as the disk will need to
seek forward more often. (but compaction will fix this)
- On file open, must seek back through the file to find a valid
header.
- More overhead for large file attachments.
Work to be done:
- More options for when to do fsync or not, to optimize for
underlying file system (before header write, after header write, not
at all, etc)
- Rollback? Do we want to support rolling back the file to previous
versions?
- Truncate on open? - When we open a file, do we want to
automatically truncate off any uncommitted garbage that could be
left over?
- Compact should write attachments in one stage of copying, then the
documents themselves, right now attachment and document writes are
interleaved per-document.
- Live upgrade of 0.9.0. It would be nice to be able to serve old
style files to allow for zero downtime on upgrade. Right now the
branch doesn't understand old files at all.
- Possibly we need to fsync on database file open, since the file
might be in the FS cache but not on disk due to a previous CouchDB
crash. This can cause problems if the view indexer (or any indexer,
like lucene) updates its index and it gets committed to disk, but
the most recent version of the database still isn't committed. Then
if the OS crashes or powerloss occurs, the index files might
unknowingly reflect lost state in the database, which would be
fixable only by doing a complete view rebuild.
Feedback on all this welcome. Please try out the branch to shake out
any bugs or performance problems that might be lurking.
-Damien
So I think this patch is ready for trunk.
It now serves old files without downtime and I've tested it out
manually, but I haven't written any automated tests for it. If you can
please try it out on a trunk database and view(s) and see if
everything still works correctly. Also test out compacting the
database to fully upgrade it to the current format. Note, please make
a backup database before doing this, just opening an old file with the
new code causes it to partially upgrade so that previous versions
don't recognize it.
This live upgrade code is sprinkled throughout the source and the
places are marked. We will remove these, probably after the next
version (0.10).
The new code has ini configurable fsyncs:
[couchdb]
sync_options = [before_header, after_header, on_file_open]
By default, all three options are on, you can turn some or all off in
the local.ini like this:
[couchdb]
sync_options = []
For default transactions, the header is only written out once per
second, reducing its size impact particularly in high volume writes.
Also the tail append stuff gives us the ability to have even more
transaction options to optimize for different systems, but that can
all be done later.
After discovering how much CPU it was eating, I turned the term
compression stuff completely off. But it should be ini or database
configurable eventually. As Adam Kocoloski pointed out, regardless
how its compressed when saved it's always readable later.
It does not have version rollback or "truncate to valid header on
open", but those are features that can be added later without much
work if necessary.
Feedback please!
-Damien