On Tue, 20 Oct 2015, Ric Wheeler wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
> > The current design is based on two simple ideas:
> > 
> >   1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> > 
> >   2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> > 
> >   - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> 
> If all of the fsync()'s fall into the same backing file system, are you sure
> that each fsync() takes the same time? Depending on the local FS
> implementation of course, but the order of issuing those fsync()'s can
> effectively make some of them no-ops.

Surely, yes, but the fact remains we are maintaining two journals: one 
internal to the fs that manages the allocation metadata, and one layered 
on top that handles the kv store's write stream.  The lower bound on any 
write is 3 IOs (unless we're talking about a COW fs).

> >   - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> 
> This seems like a a pretty low hurdle to overcome.

I wish you luck convincing upstream to allow unprivileged access to 
open_by_handle or the XFS ioctl.  :)  But even if we had that, any object 
access requires multiple metadata lookups: one in our kv db, and a second 
to get the inode for the backing file.  Again, there's an unnecessary 
lower bound on the number of IOs needed to access a cold object.

> >   - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> 
> Are you using O_DIRECT? Seems like there should be some enterprisey database
> tricks that we can use here.

It's not about about the data path, but avoiding the useless bookkeeping 
the file system is doing that we don't want or need.  See the recent 
recent reception of Zach's O_NOCMTIME patches on linux-fsdevel:

        http://marc.info/?t=143094969800001&r=1&w=2

I'm generally an optimist when it comes to introducing new APIs upstream, 
but I still found this to be an unbelievingly frustrating exchange.

> >   - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> 
> What is the goal of having the file system do the checksums? How strong do
> they need to be and what size are the chunks?
> 
> If you update this on each IO, this will certainly generate more IO (each
> write will possibly generate at least one other write to update that new
> checksum).

Not if we keep the checksums with the allocation metadata, in the 
onode/inode, which we're also doing and IO to persist.  But whther that is 
practial depends on the granularity (4KB or 16K or 128K or ...), which may 
in turn depend on the object (RBD block that'll service random 4K reads 
and writes?  or RGW fragment that is always written sequentially?).  I'm 
highly skeptical we'd ever get anything from a general-purpose file system 
that would work well here (if anything at all).

> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> The big problem with consuming block devices directly is that you ultimately
> end up recreating most of the features that you had in the file system. Even
> enterprise databases like Oracle and DB2 have been migrating away from running
> on raw block devices in favor of file systems over time.  In effect, you are
> looking at making a simple on disk file system which is always easier to start
> than it is to get back to a stable, production ready state.

This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had 
everything we were implementing and more: mainly, copy on write and data 
checksums.  But in practice the fact that its general purpose means it 
targets a very different workloads and APIs than what we need.

Now that I've realized the POSIX file namespace is a bad fit for what we 
need and opted to manage that directly, things are vastly simpler: we no 
longer have the horrific directory hashing tricks to allow PG splits (not 
because we are scared of big directories but because we need ordered 
enumeration of objects) and the transactions have exactly the granularity 
we want.  In fact, it turns out that pretty much the *only* thing the file 
system provides that we need is block allocation; everything else is 
overhead we have to play tricks to work around (batched fsync, O_NOCMTIME, 
open by handle), or something that we want but the fs will likely never 
provide (like checksums).

> I think that it might be quicker and more maintainable to spend some time
> working with the local file system people (XFS or other) to see if we can
> jointly address the concerns you have.

I have been, in cases where what we want is something that makes sense for 
other file system users.  But mostly I think that the problem is more 
that what we want isn't a file system, but an allocator + block device.

And the end result is that slotting a file system into the stack puts an 
upper bound on our performance.  On its face this isn't surprising, but 
I'm running up against it in gory detail in my efforts to make the Ceph 
OSD faster, and the question becomes whether we want to be fast or 
layered.  (I don't think 'simple' is really an option given the effort to 
work around the POSIX vs ObjectStore impedence mismatch.)

> I really hate the idea of making a new file system type (even if we call it a
> raw block store!).

Just to be clear, this isn't a new kernel file system--it's userland 
consuming a block device (ala oracle).  (But yeah, I hate it too.)

> In addition to the technical hurdles, there are also production worries like
> how long will it take for distros to pick up formal support?  How do we test
> it properly?

This actually means less for the distros to support: we'll consume 
/dev/sdb instead of an XFS mount.  Testing will be the same as before... 
the usual forced-kill and power cycle testing under the stress and 
correctness testing workloads.

What we (Ceph) will support in its place will be a combination of a kv 
store (which we already need) and a block allocator.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to