Bron Gondwana wrote:
> Hi All,
> 
> We were debugging the CPU usage in a ctl_conversationsdb rebuild yesterday, 
> and noticed an interesting thing.  70% of the CPU utilisation for this one 
> process
> was inside the kernel!  Mostly with dirty pages.
> 
> ctl_conversationsdb -R is particularly heavy on the twoskip database - it's 
> rewriting a lot of random keys.  This leads to writes all over the place, as 
> it
> stitches records into the skiplists.
> 
> Of course the "real answer"[tm] is zeroskip, which doesn't do random writes - 
> but until then, we suspect that the cost is largely due to the face that we 
> use
> mmap to read, and fwrite to write!  We know that might be less efficient 
> already from Linus' comments about 10 years ago!  And I guess here's the 
> proof.
> 
> An option would be to switch to using mmap to write as well.  We could easily 
> modify lib/mappedfile to memcpy to do the writes.
> 
> Does anybody see any strong reason not to?

I've covered the reasons for/against writing thru mmap in my LMDB design 
papers. I
don't know how relevant all of these are for your use case:

1: writing thru mmap loses any control over write ordering - the OS will page 
dirty pages out in arbitrary order.
If you're using a filesystem that supports ordered writes, it will preserve the 
ordering of data from write() calls.

2: making the mmap writable opens the possibility of undetectable data 
structure corruption if any other code
is doing stray writes through arbitrary pointers. You need to be very sure your 
code is bug-free.

3: if your DB is larger than RAM, writing thru mmap is slower than using 
write() syscalls. Whenever you
access a page for the first time, the OS will page it in. This is a wasted I/O 
if all you're doing is
overwriting the page with new data.

4: you can't use mmap exclusively, if you need to grow the output file. You can 
only write thru the mapping
to pages that already exist. If you need to grow the file, you must preallocate 
the space, otherwise you
get a SEGV when referencing unallocated pages.

And a side note, multiple studies have shown that skiplists are not 
cache-friendly, and thus have
inferior performance to B+tree organizations. A skiplist is a very poor choice 
for a read/write data structure.

Obviously I would recommend you use something carefully designed and heavily 
tested, like LMDB, instead
of whatever you're using.

There's one point in favor of writing thru mmap - if you take care of all the 
other potential gotchas,
it will work on every OS that implements mmap. Using mmap for reads, and 
syscalls for writes, is only
valid on OSs with a unified buffer cache. While this isn't a problem on most 
modern OSs, OpenBSD is a
notable example of an OS that lacks this, and so that approach always results 
in file corruption there.

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Reply via email to