Hi Sage,
Thanks for your comments, much appreciated.
Am Dienstag, 27. August 2013, 10:19:46 schrieb Sage Weil:
> Hi Guido!
>
> On Tue, 27 Aug 2013, Guido Winkelmann wrote:
[...]
> > - There is no dynamic tiered storage, and there probably never will be, if
> > I understand the architecture correctly.
> > You can have different pools with different perfomance characteristics
> > (like one on cheap and large 7200 RPM disks, and another on SSDs), but
> > once you have put a given bunch of data on one pool, it is pretty much
> > stuck there. (I.e. you cannot move it to another pool without very tight
> > and very manual coordination with all clients using it.)
>
> This is a key item on the roadmap for Emperor (nov) and Firefly (feb).
> We are building two capabilities: 'cache pools' that let you put fast
> storage in front of your main data pool, and a tiered 'cold' pool that
> lets you bleed cold objects off to a cheaper, slower tier
Sounds interesting.
Will that work on entire PGs or on single objects? How do you keep track of
which object lies on what pool without resorting to a lookup step before every
operation? Will that feature retain backwards compatibility with older Ceph
clients?
> (probably using erasure coding.. which is also coming in firefly).
... which happens to address another issue I forgot to mention
> > - There is no active data deduplication, and, again, if I understand the
> > architecture correctly, there probably never will be.
> > There is, however, sparse allocation and COW-cloning for RBD volumes,
> > which does something similar. Under certain conditions, it is even
> > possible to use the discard option of modern filesystems to automatically
> > keep unused regions of an RBD volume sparse.
>
> You can do two things:
>
> - Do dedup inside an osd. Btrfs is growing this capability, and ZFS
> already has it. This is not ideal because data is random distributed
> across nodes.
>
> - You can build dedup on top of rados, for example by naming objects after
> a hash of their content. This will never be a 'magic and transparent
> dedup for all rados apps' because CAS is based on naming objects from
> content, and rados fundamentally places data based on name and eschews
> metadata. That means there isn't normally a way to point to the content
> unless there is some MDS on top of rados. Someday CephFS will get this,
> but raw librados users and RBD won't get it for free.
I read that as TL;DR: No real deduplication.
> > - Bad support for multiple customers accessing the same cluster.
> > This is assuming that, if you have multiple customers, it is imperative
> > that any one given customer must be unable to access or even modify the
> > data of any other customer. You can have authorization on the pool layer,
> > but it has been reported that Ceph reacts badly to defining a large
> > number of pools. Multi-customer support in CephFS is non-existant.
> > RadosGW probably supports multi-customer, but I haven't tried it.
>
> The just-released Dumpling included support for rados namespaces, which
> are designed to address exactly this issue. Namespaces exist "inside"
> pools, and the auth capabilities can restrict access to a specific
> namespace.
I'm having some trouble finding this in the documentation. Can you give me a
pointer here?
> > - No dynamic partitioning for CephFS
> > The original paper talked about dynamic partioning of the CephFS
> > namespace, so that multiple Metadata Servers could share the workload of
> > a large number of CephFS clients. This isn't implemented yet (or
> > implemented but not working properly?), and the only currently support
> > multi-MDS configuration is 1 active / n standby. This limits the
> > scalability of CephFS. It looks to me like CephFS is not a major focus of
> > the development team at this time.
>
> This has been implemented since ~2006. We do not recommend it for
> production because it has not had the QA attention it deserves. That
> said, Zheng Yan has been doing a lot of great work here recently and
> things have improved considerably. Please try it! You just need to do
> 'ceph mds set_max_mds 3' (or whatever) to tell ceph how many active
> ceph-mds daemons you want.
Okay, I think I will try this.
Guido
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com