Re: How are you using Ceph?

Plaetinck, Dieter Tue, 18 Sep 2012 08:47:49 -0700

Right, it just takes time to grow these things.
Maybe the process could be accelerated by being more out there, but what do I 
know about marketing.. not much :)


Dieter

On Tue, 18 Sep 2012 10:27:52 -0500
Mark Nelson <[email protected]> wrote:

> Hi Dieter,
> 
> It sounds like some of those things will come with time (more 
> experienced community, docs, deployments, papers, etc).  Are there other 
> things we could be doing that would make Ceph feel less risky for people 
> doing similar comparisons?
> 
> Thanks,
> Mark
> 
> On 09/18/2012 10:19 AM, Plaetinck, Dieter wrote:
> > I don't mind.
> > Ultimately it came down to ceph vs swift for us.
> > Nothing is cast in stone yet, but we choose swift for our new 
> > not-yet-production cluster, because
> > swift has has been around longer and has more production deployments, and 
> > hence a bigger/more experienced community, better documentation (both 
> > official as well as unofficial, blogs, tutorials etc), more 
> > conferences/techtalks.
> > It's also a more simple system that reuses more existing technology, which 
> > makes it (a bit?) less efficient, but makes it easier to understand. (http 
> > protocol vs custom protocol, cluster metadata in sqlite, python which I'm 
> > more comfortable with than C, etc).
> > I would like to implement Ceph (because on paper it's just awesome) but 
> > running it involves a certain uncertainty/risk I personally don't want to 
> > take yet.
> >
> > Dieter
> >
> > On Tue, 18 Sep 2012 09:56:50 -0500
> > Mark Nelson<[email protected]>  wrote:
> >
> >> Agreed, this was a really interesting writeup!  Thanks John!
> >>
> >> Dieter, do you mind if I ask what is compelling for you in choosing
> >> swift vs the other options you've looked at including Ceph?
> >>
> >> Thanks,
> >> Mark
> >>
> >> On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote:
> >>> thanks a lot for the detailed writeup, I found it quite useful.
> >>> the list of contestants is similar to the list I made when researching 
> >>> (and I also had luwak);
> >>> while I also think ceph is very promising and probably deserves to 
> >>> dominate in the future,
> >>> I'm focusing on openstack swift for now. FWIW
> >>>
> >>> Dieter
> >>>
> >>> On Tue, 18 Sep 2012 16:34:23 +0200
> >>> John Axel Eriksson<[email protected]>   wrote:
> >>>
> >>>> I actually opted to not specifically mention the product we had
> >>>> problems with since there have been lots of changes and fixes to it,
> >>>> which we unfortunately were unable to make use of(you'll know why
> >>>> later). But I guess it's interesting enough to go into a little more
> >>>> detail so... before moving to Ceph we were using the Riak Distributed
> >>>> Database from Basho - http://riak.basho.com.
> >>>>
> >>>> First I have to say that Riak is actually pretty awesome in many ways
> >>>> - not in the least operations wise. Compared to Ceph it's alot easier
> >>>> to get up and running and add storage as you go... basically just one
> >>>> command to add a node to the cluster and you only need the address of
> >>>> any other existing node for this. With Riak, every node is the same,
> >>>> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
> >>>>
> >>>> As you might have thought already "Distributed Database isn't exactly
> >>>> the same as Distributed Storage" so why did we use it? Well, there is
> >>>> an add-on to Riak called Luwak, also created and supported by Basho,
> >>>> that is touted as "Large Object Support" where you can store as large
> >>>> objects as you want. I think our main problem was with using this
> >>>> add-on (as I said created and supported by Basho). An object in
> >>>> "standard" riak k/v is limited to... I think around 40 MB, or at least
> >>>> you shouldn't store larger objects than that because it means
> >>>> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> >>>> solution for the type of storage we do.
> >>>>
> >>>> We ran with Luwak for almost two years and usually it served us pretty
> >>>> well. Unfortunately there were bugs and hidden problems which i.m.o
> >>>> Basho should have been more open about. One issue is that Riak is
> >>>> based on a repair mechanism called "read-repair" - that pretty much
> >>>> tells you how it works, data will only be repaired on a read. Now that
> >>>> is a problem in itself when you archive data which we do (eg. not
> >>>> reading it very often or at all).
> >>>>
> >>>> With Luwak(the large-object add-on), data is split into many keys and
> >>>> values and stored in the "normal" riak k/v store... unfortunately
> >>>> read-repair in this scenario doesn't seem to work at all and if
> >>>> something was missing - Riak had a tendency to crash HARD, sometimes
> >>>> managing to take the whole machine with it. There were also strange
> >>>> issues where one crashing node seemed to affect it's neighbors so that
> >>>> they also crashed... a domino effect which makes "distributed" a
> >>>> little too "distributed". This didn't always happen but it did happen
> >>>> several times in our case. The logs were often pretty hard to
> >>>> understand and more often than not left us completely in the dark
> >>>> about what was going on.
> >>>>
> >>>> We also discovered that deleting data in Luwak doesn't actually DO
> >>>> anything... sure the key is gone but data is still on disk - seemingly
> >>>> orphaned, so deleting was more or less a noop. This was nowhere to be
> >>>> found in the docs.
> >>>>
> >>>> Finally, I think 3rd of June this year, we requested paid support from
> >>>> Basho to help us in our last crash-and-burn situation and that's when
> >>>> we, among other things, were told about the fact that DELETEing just
> >>>> seems to work. We were also told that Luwak was originally created to
> >>>> store email and not really the types of things we store (eg. files).
> >>>> This information wasn't available anywhere - Luwak simply had the
> >>>> wrong "table of contents" associated with it. All this was quite a
> >>>> turn-off for us. To Bashos credit they really did help us fix our
> >>>> cluster and they're really nice, friendly and helpful guys.
> >>>>
> >>>> Actually I think the last straw was when Luwak was suddenly - out of
> >>>> nowhere really - discontinued around the beginning of this year,
> >>>> probably because of the bugs and hidden problems that I think may have
> >>>> come from a less than stellar implementation of large-object support
> >>>> from the start... so by then we were on something completely
> >>>> unsupported. We couldn't switch to something else immediately of
> >>>> course but we started looking around for something else at that time.
> >>>> That's when I found Ceph among other more or less distributed systems,
> >>>> where the others were:
> >>>>
> >>>> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> >>>> XtreemFS         http://www.xtreemfs.org
> >>>> HDFS             http://hadoop.apache.org/hdfs/
> >>>> GlusterFS        http://www.gluster.org
> >>>> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> >>>> moosefs          http://www.moosefs.org
> >>>> Openstack Swift  http://docs.openstack.org/developer/swift/
> >>>> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> >>>> LS4              http://ls4.sourceforge.net/
> >>>>
> >>>> After trying most of these I decided to look closer at a few of them,
> >>>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> >>>> suited for our use case or just too complicated to setup and keep
> >>>> running (i.m.o). For a short while I dabbled in writing my own storage
> >>>> system using zeromq for communication but it's just not what our
> >>>> company does - so I gave that up pretty quickly :-). In the end I
> >>>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> >>>> every other aspect better and definitely a good fit. The Rados
> >>>> Gateway(S3 compat) was really a big thing for us as well.
> >>>>
> >>>> As I started out saying: there have been many improvements to Riak not
> >>>> in the least to the large-object support... but that large-object
> >>>> support is not built on Luwak but a completely new thing and it's not
> >>>> open source or free. It's called Riak CS(CS for Cluster Storage I
> >>>> guess) and has an S3 compatible interface and it seems to be pretty
> >>>> good. We had many discussions internally if Riak CS was the right move
> >>>> for us but in the end we decided on Ceph since we couldn't justify the
> >>>> cost of Riak CS.
> >>>>
> >>>> To sum it up: we made, in retrospect, a bad choice - not because Riak
> >>>> itself doesn't work or isn't any good for the things it's good at(it
> >>>> really is!) but because the add-on Luwak was misrepresented and not a
> >>>> good fit for us.
> >>>>
> >>>> I really have high hopes for Ceph and I think it has a bright future
> >>>> in our company and in general. Riak CS would probably have been a very
> >>>> good fit as well if it wasn't for the cost involved.
> >>>>
> >>>> So there you have it - not just failure scenarios but bad decisions,
> >>>> misrepresenation of features and somewhat sparse documentation. By the
> >>>> the way, Ceph has improved it's docs alot but still could use some
> >>>> work.
> >>>>
> >>>> -John
> >>>>
> >>>>
> >>>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<[email protected]>   
> >>>> wrote:
> >>>>> On Tue, 18 Sep 2012 01:26:03 +0200
> >>>>> John Axel Eriksson<[email protected]>   wrote:
> >>>>>
> >>>>>> another distributed
> >>>>>> storage solution that had failed us more than once and we lost data.
> >>>>>> Since the old system had an http interface (not S3 compatible though)
> >>>>>
> >>>>> can you say a bit more about this? failure stories are very interesting 
> >>>>> and useful.
> >>>>>
> >>>>> Dieter
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to [email protected]
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How are you using Ceph?

Reply via email to