First let me say I love SQL and it has treated me well for many years. The 
problem I see is that SQL is only a partial solution; right?  Can we  eliminate 
solr? What about NFS?  What will we do about analytics?  As far as I can tell, 
Cassandra is the only technology that allows us to solve a very complex set of 
divergent problems with a single solution.  So the discussion I think we are 
having is not SQL vs NoSQL as much as complexity vs simplicity.  

FYI - I have done some more research into when real world implementers choose 
Cassandra over SQL.  I will try to capture the essence of the talking points 
below and provide some local context where possible:

If you are not using big data to your advantage, your competitor is!  
Blackboard’s xPlor announcement last week should send a very clear shot across 
our bow (i.e. MongoDB + ElasticSearch + LTI Provider).  To say this very 
clearly again - we cannot be chasing last year’s LMS - if we are not evolving 
we are dying - we must make a strategic decision for the future of our 
application. We simply cannot let the industry outpace us tomorrow by making a 
poor selection today.
High velocity writes; e.g. do you have a large amount of incoming data to your 
application in a short amount of time? I imagine our activity stream 
requirements could fall into this category and I am certain others will as 
well.  But once you have this capability, imagine the kinds of uses we might 
find for it: e.g. learning analytics, click stream analysis, log file storage 
and analysis, etc.  Being able to act on big data / fast data is a key 
differentiator in today’s marketplace and something we should factor into our 
decision making process.
Large volume of data; i.e. terabytes or petabytes.  I would imagine a large or 
multi-tenant implementation easily getting to these kinds of numbers within a 
year or so (especially as storage needs are increasing dramatically).
A mixture of data types; i.e. structured, semi-structured, and unstructured.  
We have all three!  OAE falls into this case with our mixture of user generated 
content, activity streams, and more traditional data.  If you have data that 
has some kind of natural ordering, for example time-based ordering, Cassandra 
provides a very compelling solution as it has built-in support for returning 
ranges of rows based on natural ordering (e.g. activity streams, inboxes, etc).
Deployment complexity; i.e. do you need your data to be distributed across 
geographic zones and/or regions to support continuous availability or disaster 
recovery scenarios?  Most institutions will share the common DR concern and 
some will also be concerned about continuous availability; i.e. when 
availability zones bomb (e.g. Amazon East). Cassandra already has built-in 
support for geospatial replica locations like racks and data centers. Cassandra 
also has no single point of failure - all nodes are equal.
Do you need true ACID properties? Cassandra provides “AID” and Consistency via 
CAP Theorem which is tunable per read or write operation.
Manually sharding relational databases is very HARD and takes a TON of 
investment to get right. Expanding online capacity and scalability of a live 
relational database is also very difficult.  These are real operational 
concerns and there are many examples where this kind of growing pain is not 
easy to overcome and can break a project.
No need for a complex memcached cluster, or cache getting out of sync. FYI - 
memcached is usually one of the first solutions employed to address scaling 
issues with a relational DB.  Cassandra natively provides memcached-like 
behavior straight out of the box with its caching strategy (and no cache 
coherency issues to boot).
Does your data easily compress?  Cassandra supplies built-in data compression, 
with up to an 80 percent reduction in raw data footprint. Cassandra’s 
compression also results in no performance penalty, with some read/write 
operations speeding up due to less physical I/O being managed.
Do you want to significantly reduce operational costs by running on simple 
commodity hardware?  Cassandra runs on commodity machines and requires no 
expensive or special hardware; i.e. no expensive RAID equipment required; just 
plain old disks on cheap hardware.


On Jul 6, 2012, at 5:52 PM, Scot Hacker <[email protected]> wrote:

> I'm not part of the decision team on this, but here's my personal opinion 
> anyway, FWIW.
> 
> I believe large efficiencies in developer velocity could be gained if OAE  
> assumed a relational db. Going that way could *in theory* pave the way for 
> future tools and facilities such as an ORM, a self-generating internal API, 
> native data integrity enforcement at every level of the system, 
> straightforward cascade deletes, and simplified output of data into other 
> contexts (these services could be provided by existing Java tools or  be 
> built in-house as needed).
> 
> Since all of these features assume a relational back-end, I see a move to 
> relational as an important first step in the drive to simplify the OAE 
> codebase overall. Frankly, I think we're paying a pretty huge price for not 
> using a relational schema to begin with. Most of the data in our systems is 
> highly relational, and relational data opens the door for tons of tools and 
> capabilities we can't get with a non-rel system.
> 
> IOTW, the decision is not only about current technical requirements - it's 
> also about future tools and services that could come along for the ride. 
> 
> Just my .02.
> 
> __________________________
> Scot Hacker
> Senior Software Developer @ CalCentral
> Educational Technology Services, UC Berkeley
> 
> [email protected]
> (510) 292-5586
> __________________________
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> oae-dev mailing list
> [email protected]
> http://collab.sakaiproject.org/mailman/listinfo/oae-dev

_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Reply via email to