First let me say I love SQL and it has treated me well for many years. The problem I see is that SQL is only a partial solution; right? Can we eliminate solr? What about NFS? What will we do about analytics? As far as I can tell, Cassandra is the only technology that allows us to solve a very complex set of divergent problems with a single solution. So the discussion I think we are having is not SQL vs NoSQL as much as complexity vs simplicity.
FYI - I have done some more research into when real world implementers choose Cassandra over SQL. I will try to capture the essence of the talking points below and provide some local context where possible: If you are not using big data to your advantage, your competitor is! Blackboard’s xPlor announcement last week should send a very clear shot across our bow (i.e. MongoDB + ElasticSearch + LTI Provider). To say this very clearly again - we cannot be chasing last year’s LMS - if we are not evolving we are dying - we must make a strategic decision for the future of our application. We simply cannot let the industry outpace us tomorrow by making a poor selection today. High velocity writes; e.g. do you have a large amount of incoming data to your application in a short amount of time? I imagine our activity stream requirements could fall into this category and I am certain others will as well. But once you have this capability, imagine the kinds of uses we might find for it: e.g. learning analytics, click stream analysis, log file storage and analysis, etc. Being able to act on big data / fast data is a key differentiator in today’s marketplace and something we should factor into our decision making process. Large volume of data; i.e. terabytes or petabytes. I would imagine a large or multi-tenant implementation easily getting to these kinds of numbers within a year or so (especially as storage needs are increasing dramatically). A mixture of data types; i.e. structured, semi-structured, and unstructured. We have all three! OAE falls into this case with our mixture of user generated content, activity streams, and more traditional data. If you have data that has some kind of natural ordering, for example time-based ordering, Cassandra provides a very compelling solution as it has built-in support for returning ranges of rows based on natural ordering (e.g. activity streams, inboxes, etc). Deployment complexity; i.e. do you need your data to be distributed across geographic zones and/or regions to support continuous availability or disaster recovery scenarios? Most institutions will share the common DR concern and some will also be concerned about continuous availability; i.e. when availability zones bomb (e.g. Amazon East). Cassandra already has built-in support for geospatial replica locations like racks and data centers. Cassandra also has no single point of failure - all nodes are equal. Do you need true ACID properties? Cassandra provides “AID” and Consistency via CAP Theorem which is tunable per read or write operation. Manually sharding relational databases is very HARD and takes a TON of investment to get right. Expanding online capacity and scalability of a live relational database is also very difficult. These are real operational concerns and there are many examples where this kind of growing pain is not easy to overcome and can break a project. No need for a complex memcached cluster, or cache getting out of sync. FYI - memcached is usually one of the first solutions employed to address scaling issues with a relational DB. Cassandra natively provides memcached-like behavior straight out of the box with its caching strategy (and no cache coherency issues to boot). Does your data easily compress? Cassandra supplies built-in data compression, with up to an 80 percent reduction in raw data footprint. Cassandra’s compression also results in no performance penalty, with some read/write operations speeding up due to less physical I/O being managed. Do you want to significantly reduce operational costs by running on simple commodity hardware? Cassandra runs on commodity machines and requires no expensive or special hardware; i.e. no expensive RAID equipment required; just plain old disks on cheap hardware. On Jul 6, 2012, at 5:52 PM, Scot Hacker <[email protected]> wrote: > I'm not part of the decision team on this, but here's my personal opinion > anyway, FWIW. > > I believe large efficiencies in developer velocity could be gained if OAE > assumed a relational db. Going that way could *in theory* pave the way for > future tools and facilities such as an ORM, a self-generating internal API, > native data integrity enforcement at every level of the system, > straightforward cascade deletes, and simplified output of data into other > contexts (these services could be provided by existing Java tools or be > built in-house as needed). > > Since all of these features assume a relational back-end, I see a move to > relational as an important first step in the drive to simplify the OAE > codebase overall. Frankly, I think we're paying a pretty huge price for not > using a relational schema to begin with. Most of the data in our systems is > highly relational, and relational data opens the door for tons of tools and > capabilities we can't get with a non-rel system. > > IOTW, the decision is not only about current technical requirements - it's > also about future tools and services that could come along for the ride. > > Just my .02. > > __________________________ > Scot Hacker > Senior Software Developer @ CalCentral > Educational Technology Services, UC Berkeley > > [email protected] > (510) 292-5586 > __________________________ > > > > > > > > _______________________________________________ > oae-dev mailing list > [email protected] > http://collab.sakaiproject.org/mailman/listinfo/oae-dev
_______________________________________________ oae-dev mailing list [email protected] http://collab.sakaiproject.org/mailman/listinfo/oae-dev
