On Tue, 20 Oct 2015, Haomai Wang wrote:
> On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil <[email protected]> wrote:
> > The current design is based on two simple ideas:
> >
> > 1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> > 2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2. A few
> > things:
> >
> > - We currently write the data to the file, fsync, then commit the kv
> > transaction. That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3). So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> >
> > - On read we have to open files by name, which means traversing the fs
> > namespace. Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups. We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> >
> > - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes. (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> >
> > - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> >
> > But what's the alternative? My thought is to just bite the bullet and
> > consume a raw block device directly. Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
>
> This is really a tough decision. Although making a block device based
> objectstore never walk out my mind since two years ago.
>
> We would much more concern about the effective of space utilization
> compared to local fs, the buggy, the consuming time to build a tiny
> local filesystem. I'm a little afraid of we would stuck into....
>
> >
> > Wins:
> >
> > - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before). For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
>
> Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
> area from my perf.
With this change it is close to parity:
https://github.com/facebook/rocksdb/pull/746
> > - No concern about mtime getting in the way
> >
> > - Faster reads (no fs lookup)
> >
> > - Similarly sized metadata for most objects. If we assume most objects
> > are not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> > - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage. Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter. But what happens when we are storing gobs of
> > rgw index data or cephfs metadata? Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> >
> > - We have to write and maintain an allocator. I'm still optimistic this
> > can be reasonbly simple, especially for the flash case (where
> > fragmentation isn't such an issue as long as our blocks are reasonbly
> > sized). For disk we may beed to be moderately clever.
> >
> > - We'll need a fsck to ensure our internal metadata is consistent. The
> > good news is it'll just need to validate what we have stored in the kv
> > store.
> >
> > Other thoughts:
> >
> > - We might want to consider whether dm-thin or bcache or other block
> > layers might help us with elasticity of file vs block areas.
> >
> > - Rocksdb can push colder data to a second directory, so we could have a
> > fast ssd primary area (for wal and most metadata) and a second hdd
> > directory for stuff it has to push off. Then have a conservative amount
> > of file space on the hdd. If our block fills up, use the existing file
> > mechanism to put data there too. (But then we have to maintain both the
> > current kv + file approach and not go all-in on kv + block.)
>
> A complex way...
>
> Actually I would like to employ FileStore2 impl, which means we still
> use FileJournal(or alike ..). But we need to employ more memory to
> keep metadata/xattrs and use aio+dio to flush disk. A userspace
> pagecache needed to be impl. Then we can skip journal if full write,
> because osd is pg isolation we could make a barrier for single pg when
> skipping journal. @Sage Is there other concerns for filestore skip
> journal?
>
> In a word, I like the model that filestore owns, but we need to have a
> big refactor for existing impl.
>
> Sorry to disturb the thought....
I think the directory (re)hashing strategy in filestore is too expensive,
and I don't see how it can be fixed without managing the namespace
ourselves (as newstore does).
If we want a middle road approach where we still rely on a file system for
doing block allocation then IMO the current incarnation of newstore is the
right path...
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html