Re: Tail Append Headers

Damien Katz Wed, 20 May 2009 08:48:44 -0700


On May 20, 2009, at 4:12 AM, Brian Candler wrote:

On Mon, May 18, 2009 at 01:59:08PM -0400, Damien Katz wrote:
If you have an application where you don't mind
losing your most recent updates, you could turn off fsync alltogether.However, this assumes ordered-sequential writes, that the FS willnever
write out the later bytes before the earlier bytes.
... and also that the drive doesn't reorder the writes itself.

Correct, that's what I mean by FS, file system. Different setups willhave different behaviors, which is why we'll have the flush settingsini customizable to optimize for the underlying FS (fsync beforeheader write, after header write, not at all, etc).

You could checksum each 4K data block (if that's not done already),but thenyou'd need to scan from the front of the file to the end to find thefirst
invalid block. Perhaps that's the job of a recovery tool, rather than
something to be done every time the database is opened.
Downsides:
- Every update to the database will have up to 4k of overhead forheaderwriting (the actual header is smaller, but must be written 4kaligned).
At 4KB, one (batch of) writes per second equates to ~337MB per day of
overhead. Fairly significant, although perhaps not too bad withdaily or
weekly compaction.

The header overhead is more like 2k on average. But I don't mindwasting a diskspace, it's an extremely cheap resourc..

1K blocks would be a lot better from that point of view, presumablyat the
cost of more work breaking up docs and attachments.

Sure, I picked 4k as a number out of the air, it's something that'scan be tuned.

For writes of individual small docs, do you always write out a 4KBdatablock followed by a 4KB header block? If so, a simple optimisationwould be
a mixed data+header block:

00 .... data
01 hh hh .... <<hhhh bytes of data followed by header>>>
I'd think that it's pointless to write out a new header unless therehas
been some data to write as well.


We don't write out a db header unless it's changed.

- Individually updated documents are more sparse on disk by default,
making long view builds slower (in theory) as the disk will need toseek
forward more often. (but compaction will fix this)
Maybe it won't do head seeks, if the O/S or application does read-ahead, but
you'll certainly be reading through a larger file.
In general, I would happily trade some speed for better crash-resistance.
As for timing of fsync: ideally what I would like is for each write
operation to return some sort of checkpoint tag (which could just bethecurrent file size). Then have a new HTTP operation "wait for sync<tag>"which blocks until the file has been fsync'd at or after thatposition.


Just use the x-couch-full-commit=true http header.

This would allow useful semantics in proxies. e.g. I don't want toreturn anacknowledgement to a client until the document has been safelywritten to atleast two locations, but I don't want to force an fsync for thoserequests.
One other thought. Is there value in extending the file in chunksof, say,
16MB - in the hope that the O/S is more likely to allocate contiguous
regions of storage?


I don't know, probably depends on the FS.

-Damien

Re: Tail Append Headers

Reply via email to