The first thing added to the unstable tree now that 0.19 is out is some 
work to improve the object storage and journaling code.  The new approach 
is more flexible in terms of how you can utilize your available hardware 
on each storage node, and also removes the currently strict btrfs 
requirement.

First, a bit of history.  The original storage daemon used a userspace 
file system (ebofs) that was copy on write with frequent superblock 
commits (that would push all prior operations to disk).  Since a full 
commit could be somewhat slow (multiple disk seeks), a "write-behind" 
journal was added: a write was first applied to the local fs (ebofs), and 
then an entry was written to a journal.  If the journal entry committed to 
disk before ebofs did a full commit, we could ack sooner.  If the journal 
was slow, or it filled up, then we were no worse off, because we could 
still ack when the full commit happened.  The journal can be either a file 
or a raw block device.  A separate disk is good; an SSD or NVRAM device is 
better.  A file on another device works too, although it won't be quite as 
fast.

Then we dropped ebofs in favor of btrfs: it had more features, better 
testing, and (one hoped) better performance.  The problem was with 
commits: when a btrfs commit starts, new write operations block while the 
fs gets its ducks in a row to flush everything out to disk.  That means 
that cosd writes would block waiting to apply them to the fs (sometimes 
seconds), even though the journal device was otherwise idle (with typical 
latencies <10ms).

So, the new code decouples applying the writes (to the fs) and journaling 
entirely into different threadpools.  You can either operate in writeahead 
mode (journal, then apply), parallel mode (start both immediately), or the 
(old) writebehind mode (apply, then journal--not recommended).  The trick 
is that you have to apply before you can read back the change, and you 
have to commit before you can ack (usually). Since the commit and apply 
can complete in any order, lots of random code had to be changed to 
properly reflect what the read-what-i-just-wrote dependencies were, and 
some additional tracking of in-progres writes was needed.

There are two big wins here:

 * If you're using a journal (ideally a second device.. a raw disk, or 
ssd, or nvram), write latency will be consistently lower.  Any 
sluggishness from the fs doing it's commits won't have an immediate 
impact.

 * Btrfs isn't strictly required anymore for maintaining consistency.  
Previously, we used some btrfs ioctls to group operations together into 
transactions to ensure they would commit to disk as a unit.  With 
writeahead journaling, that's no longer necessary: we just replay 
everything in the journal that isn't known to have committed to the fs; 
any incorrectly duplicated operations are harmless.  The only requirement 
is that operations are ordered, as with ext3's data=ordered.

That said, btrfs is still preferred.  Ceph makes heavy use of xattrs, and 
many objects are small; given that workload the btrfs 
everything-in-a-btree approach should be a big win.  We also hook into 
btrfs to cheaply clone objects without copying actual data, which makes 
Ceph snapshots perform better.  (It'll still work with other file systems, 
but the object copy will be an actual copy, and slow.)

Using btrfs also means you can operate without a journal, which will work 
fine in cases where low latency writes aren't a requirement.

Anyway, these changes will get lots of testing in unstable over the next 
few weeks and will go into the next release (v0.20).

sage

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ceph-devel mailing list
Ceph-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ceph-devel

Reply via email to