Right, it just takes time to grow these things. Maybe the process could be accelerated by being more out there, but what do I know about marketing.. not much :)
Dieter On Tue, 18 Sep 2012 10:27:52 -0500 Mark Nelson <[email protected]> wrote: > Hi Dieter, > > It sounds like some of those things will come with time (more > experienced community, docs, deployments, papers, etc). Are there other > things we could be doing that would make Ceph feel less risky for people > doing similar comparisons? > > Thanks, > Mark > > On 09/18/2012 10:19 AM, Plaetinck, Dieter wrote: > > I don't mind. > > Ultimately it came down to ceph vs swift for us. > > Nothing is cast in stone yet, but we choose swift for our new > > not-yet-production cluster, because > > swift has has been around longer and has more production deployments, and > > hence a bigger/more experienced community, better documentation (both > > official as well as unofficial, blogs, tutorials etc), more > > conferences/techtalks. > > It's also a more simple system that reuses more existing technology, which > > makes it (a bit?) less efficient, but makes it easier to understand. (http > > protocol vs custom protocol, cluster metadata in sqlite, python which I'm > > more comfortable with than C, etc). > > I would like to implement Ceph (because on paper it's just awesome) but > > running it involves a certain uncertainty/risk I personally don't want to > > take yet. > > > > Dieter > > > > On Tue, 18 Sep 2012 09:56:50 -0500 > > Mark Nelson<[email protected]> wrote: > > > >> Agreed, this was a really interesting writeup! Thanks John! > >> > >> Dieter, do you mind if I ask what is compelling for you in choosing > >> swift vs the other options you've looked at including Ceph? > >> > >> Thanks, > >> Mark > >> > >> On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote: > >>> thanks a lot for the detailed writeup, I found it quite useful. > >>> the list of contestants is similar to the list I made when researching > >>> (and I also had luwak); > >>> while I also think ceph is very promising and probably deserves to > >>> dominate in the future, > >>> I'm focusing on openstack swift for now. FWIW > >>> > >>> Dieter > >>> > >>> On Tue, 18 Sep 2012 16:34:23 +0200 > >>> John Axel Eriksson<[email protected]> wrote: > >>> > >>>> I actually opted to not specifically mention the product we had > >>>> problems with since there have been lots of changes and fixes to it, > >>>> which we unfortunately were unable to make use of(you'll know why > >>>> later). But I guess it's interesting enough to go into a little more > >>>> detail so... before moving to Ceph we were using the Riak Distributed > >>>> Database from Basho - http://riak.basho.com. > >>>> > >>>> First I have to say that Riak is actually pretty awesome in many ways > >>>> - not in the least operations wise. Compared to Ceph it's alot easier > >>>> to get up and running and add storage as you go... basically just one > >>>> command to add a node to the cluster and you only need the address of > >>>> any other existing node for this. With Riak, every node is the same, > >>>> so there is no SPOF by default (eg. no MDS, no MON - just nodes). > >>>> > >>>> As you might have thought already "Distributed Database isn't exactly > >>>> the same as Distributed Storage" so why did we use it? Well, there is > >>>> an add-on to Riak called Luwak, also created and supported by Basho, > >>>> that is touted as "Large Object Support" where you can store as large > >>>> objects as you want. I think our main problem was with using this > >>>> add-on (as I said created and supported by Basho). An object in > >>>> "standard" riak k/v is limited to... I think around 40 MB, or at least > >>>> you shouldn't store larger objects than that because it means > >>>> "trouble". Anyway, we went with Luwak which seemed to be a perfect > >>>> solution for the type of storage we do. > >>>> > >>>> We ran with Luwak for almost two years and usually it served us pretty > >>>> well. Unfortunately there were bugs and hidden problems which i.m.o > >>>> Basho should have been more open about. One issue is that Riak is > >>>> based on a repair mechanism called "read-repair" - that pretty much > >>>> tells you how it works, data will only be repaired on a read. Now that > >>>> is a problem in itself when you archive data which we do (eg. not > >>>> reading it very often or at all). > >>>> > >>>> With Luwak(the large-object add-on), data is split into many keys and > >>>> values and stored in the "normal" riak k/v store... unfortunately > >>>> read-repair in this scenario doesn't seem to work at all and if > >>>> something was missing - Riak had a tendency to crash HARD, sometimes > >>>> managing to take the whole machine with it. There were also strange > >>>> issues where one crashing node seemed to affect it's neighbors so that > >>>> they also crashed... a domino effect which makes "distributed" a > >>>> little too "distributed". This didn't always happen but it did happen > >>>> several times in our case. The logs were often pretty hard to > >>>> understand and more often than not left us completely in the dark > >>>> about what was going on. > >>>> > >>>> We also discovered that deleting data in Luwak doesn't actually DO > >>>> anything... sure the key is gone but data is still on disk - seemingly > >>>> orphaned, so deleting was more or less a noop. This was nowhere to be > >>>> found in the docs. > >>>> > >>>> Finally, I think 3rd of June this year, we requested paid support from > >>>> Basho to help us in our last crash-and-burn situation and that's when > >>>> we, among other things, were told about the fact that DELETEing just > >>>> seems to work. We were also told that Luwak was originally created to > >>>> store email and not really the types of things we store (eg. files). > >>>> This information wasn't available anywhere - Luwak simply had the > >>>> wrong "table of contents" associated with it. All this was quite a > >>>> turn-off for us. To Bashos credit they really did help us fix our > >>>> cluster and they're really nice, friendly and helpful guys. > >>>> > >>>> Actually I think the last straw was when Luwak was suddenly - out of > >>>> nowhere really - discontinued around the beginning of this year, > >>>> probably because of the bugs and hidden problems that I think may have > >>>> come from a less than stellar implementation of large-object support > >>>> from the start... so by then we were on something completely > >>>> unsupported. We couldn't switch to something else immediately of > >>>> course but we started looking around for something else at that time. > >>>> That's when I found Ceph among other more or less distributed systems, > >>>> where the others were: > >>>> > >>>> Tahoe-LAFS https://tahoe-lafs.org/trac/tahoe-lafs > >>>> XtreemFS http://www.xtreemfs.org > >>>> HDFS http://hadoop.apache.org/hdfs/ > >>>> GlusterFS http://www.gluster.org > >>>> PomegranateFS https://github.com/macan/Pomegranate/wiki > >>>> moosefs http://www.moosefs.org > >>>> Openstack Swift http://docs.openstack.org/developer/swift/ > >>>> MongoDB GridFS http://www.mongodb.org/display/DOCS/GridFS > >>>> LS4 http://ls4.sourceforge.net/ > >>>> > >>>> After trying most of these I decided to look closer at a few of them, > >>>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really > >>>> suited for our use case or just too complicated to setup and keep > >>>> running (i.m.o). For a short while I dabbled in writing my own storage > >>>> system using zeromq for communication but it's just not what our > >>>> company does - so I gave that up pretty quickly :-). In the end I > >>>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in > >>>> every other aspect better and definitely a good fit. The Rados > >>>> Gateway(S3 compat) was really a big thing for us as well. > >>>> > >>>> As I started out saying: there have been many improvements to Riak not > >>>> in the least to the large-object support... but that large-object > >>>> support is not built on Luwak but a completely new thing and it's not > >>>> open source or free. It's called Riak CS(CS for Cluster Storage I > >>>> guess) and has an S3 compatible interface and it seems to be pretty > >>>> good. We had many discussions internally if Riak CS was the right move > >>>> for us but in the end we decided on Ceph since we couldn't justify the > >>>> cost of Riak CS. > >>>> > >>>> To sum it up: we made, in retrospect, a bad choice - not because Riak > >>>> itself doesn't work or isn't any good for the things it's good at(it > >>>> really is!) but because the add-on Luwak was misrepresented and not a > >>>> good fit for us. > >>>> > >>>> I really have high hopes for Ceph and I think it has a bright future > >>>> in our company and in general. Riak CS would probably have been a very > >>>> good fit as well if it wasn't for the cost involved. > >>>> > >>>> So there you have it - not just failure scenarios but bad decisions, > >>>> misrepresenation of features and somewhat sparse documentation. By the > >>>> the way, Ceph has improved it's docs alot but still could use some > >>>> work. > >>>> > >>>> -John > >>>> > >>>> > >>>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<[email protected]> > >>>> wrote: > >>>>> On Tue, 18 Sep 2012 01:26:03 +0200 > >>>>> John Axel Eriksson<[email protected]> wrote: > >>>>> > >>>>>> another distributed > >>>>>> storage solution that had failed us more than once and we lost data. > >>>>>> Since the old system had an http interface (not S3 compatible though) > >>>>> > >>>>> can you say a bit more about this? failure stories are very interesting > >>>>> and useful. > >>>>> > >>>>> Dieter > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to [email protected] > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
