On May 20, 2009, at 4:12 AM, Brian Candler wrote:

On Mon, May 18, 2009 at 01:59:08PM -0400, Damien Katz wrote:
If you have an application where you don't mind
losing your most recent updates, you could turn off fsync all together. However, this assumes ordered-sequential writes, that the FS will never
write out the later bytes before the earlier bytes.

... and also that the drive doesn't reorder the writes itself.

Correct, that's what I mean by FS, file system. Different setups will have different behaviors, which is why we'll have the flush settings ini customizable to optimize for the underlying FS (fsync before header write, after header write, not at all, etc).



You could checksum each 4K data block (if that's not done already), but then you'd need to scan from the front of the file to the end to find the first
invalid block. Perhaps that's the job of a recovery tool, rather than
something to be done every time the database is opened.
Downsides:
- Every update to the database will have up to 4k of overhead for header writing (the actual header is smaller, but must be written 4k aligned).

At 4KB, one (batch of) writes per second equates to ~337MB per day of
overhead. Fairly significant, although perhaps not too bad with daily or
weekly compaction.

The header overhead is more like 2k on average. But I don't mind wasting a diskspace, it's an extremely cheap resourc..


1K blocks would be a lot better from that point of view, presumably at the
cost of more work breaking up docs and attachments.

Sure, I picked 4k as a number out of the air, it's something that's can be tuned.


For writes of individual small docs, do you always write out a 4KB data block followed by a 4KB header block? If so, a simple optimisation would be
a mixed data+header block:

00 .... data
01 hh hh .... <<hhhh bytes of data followed by header>>>

I'd think that it's pointless to write out a new header unless there has
been some data to write as well.

We don't write out a db header unless it's changed.


- Individually updated documents are more sparse on disk by default,
making long view builds slower (in theory) as the disk will need to seek
forward more often. (but compaction will fix this)

Maybe it won't do head seeks, if the O/S or application does read- ahead, but
you'll certainly be reading through a larger file.

In general, I would happily trade some speed for better crash- resistance.

As for timing of fsync: ideally what I would like is for each write
operation to return some sort of checkpoint tag (which could just be the current file size). Then have a new HTTP operation "wait for sync <tag>" which blocks until the file has been fsync'd at or after that position.

Just use the x-couch-full-commit=true http header.


This would allow useful semantics in proxies. e.g. I don't want to return an acknowledgement to a client until the document has been safely written to at least two locations, but I don't want to force an fsync for those requests.

One other thought. Is there value in extending the file in chunks of, say,
16MB - in the hope that the O/S is more likely to allocate contiguous
regions of storage?

I don't know, probably depends on the FS.

-Damien

Reply via email to