Hello, everyone. Lance, Carl, Chris, and I attended the third annual Apache 
Cassandra summit yesterday in Santa Clara California. I wanted to go because 
Cassandra is a finalist in our search for a dependable storage technology to 
begin using with OAE. Our criteria include ease of use for deployers, ease of 
use for developers, strength of the community, suitability of the license, 
track record, options for querying, options for scaling, and data integrity 
features (atomicity, isolation, referential integrity, etc.). After eliminating 
a few choices, we're down to evaluating Cassandra and good old relational 
database over JDBC.

## Some Background
I took my first look at Cassandra about 18 months ago. The way you model your 
data for Cassandra is a little foreign if you come from a relational database 
background. It wasn't obvious how you would write queries for it, and at that 
time I wasn't really paying attention to Cassandra's remarkable cluster 
capabilities. I moved on, thinking of Cassandra as "just another Apache 
project." When we recently started looking for storage options for OAE, I 
wanted to give Cassandra a second look, and Carl wrote a proof of concept using 
Cassandra as the backing store for our connections module (aka contacts). 

On my second look, I really started to appreciate how Cassandra clusters work. 
Cassandra is meant to be run in a cluster from the get-go, rather than 
clustering being a special configuration. The topology is a ring, each node 
having just part of the complete data set, and how many copies of the data you 
want in your cluster is configurable. There are no master nodes; they are all 
peers, and there is no single point of failure. The nodes notify each other 
about the configuration and health of the ring with a protocol called gossip. 
If a node fails, you can replace it with a fresh one and it will be rebuilt 
from the replicas, all without any interruption in service.

Going into the summit, my feeling was that Cassandra was a well-made, 
attractive technology, but that its real downside was being unfamiliar to 
developers and deployers.

## Impressions from the Summit
The Cassandra summit was a one-day event hosted by DataStax, the company that 
provides commercial support and proprietary tools for Cassandra and some of its 
software cousins: Solr and Hadoop. There were sessions in several tracks, and 
the speakers were from companies that are using Cassandra in large-scale 
production to solve various problems in their businesses. Here are some of the 
things I took away from the sessions:

* Cassandra has come a long way since the first version I looked at. In 
particular, their query language, CQL, has now reached feature parity with the 
command-line interface. CQL is the intended (and supported) way of interacting 
with the database from here out. They have also recently added support for 
richer column types (sets, lists, and maps) and row-update isolation.
* Cassandra has no trouble spreading a cluster among multiple data centers. One 
of the memorable quotes was by a Netflix engineer, "If you care about 
availability, you'll be in multiple data centers." Netflix runs its clusters on 
Amazon's EC2 cloud, and when an entire Amazon region went down last month, 
Netflix's service remained uninterrupted. By the way, Cassandra is aware of the 
data centers (and racks within data centers) and never incurs additional 
network latency if it can exchange data locally. I think the option to have a 
globally distributed data store is a real game-changer.
* The data model is different, but it's not _that_ different. Having read some 
documentation and sat in on a couple of data modeling sessions, I'm getting it 
now. The basic idea is that data which should be read together (i.e. the result 
you intend to get back from a query) should be stored together. I could store 
Twitter feeds one row per user, where the columns in each row are a 
time-ordered list of tweets. The query for that is just "get me the row for 
@aeroplanesoft". There are no joins in Cassandra, but the right way to think 
about this is that it's just like denormalization in a relational database. In 
Cassandra, you denormalize by default.
* Some companies use a lot of different storage technologies at the same time, 
even startups. I saw a talk all about this from a startup called SimpleReach. 
He calls this "polyglot persistence". They emphasize picking the right storage 
technology for the job. The tradeoff of course is increased complexity. Sakai 
OAE is already doing this: we've got a cocktail of sparsemap, Jackrabbit, and 
solr. We're looking for just the right recipe.
* Transactional guarantees are possible. Transactions in the relational 
database sense are not supported, but when you really need this, you can get it 
by implementing a transaction log in a Cassandra row. Matt Dennis gave a great 
talk where he demonstrated this technique with the most famous example: credit 
an amount from one bank account and debit the amount to another. You store 
account balance deltas in a transaction log, and if you encounter a failure 
anywhere during the transaction, you replay the deltas from the commit log to 
get your accounts into a provably correct state. To quote Matt, "What really 
matters is that you never lose any messages. On read, you can work out the 
correct values."
* Cassandra goes hand in hand with offline analytics workflows. The typical 
production setup will allocate one part of the Cassandra cluster for real-time 
reads and writes, and another part for batch processing (e.g. Hadoop jobs). 
This workflow isolation ensures that the two types of work don't interfere with 
one another. My favorite talk of the day was by John Akred from Accenture, who 
has done some fascinating work with processing and visualizing data from 
millions of smart electric meters, from offshore drilling platforms, and from 
Toronto's unusual underground power grid. They replaced a ten million dollar 
investment in database hardware and software with $10,000 per month in cloud 
computing. The term "Big Data" refers to two different kinds of Big: big in 
size and big in rate of data throughput. When your storage technology relieves 
the constraints on the "bigness" of your data, you can start thinking about 
posing questions and solving problems that we (so far) have not 
 talked much about in Sakai.

In short, I think these Cassandra people are on to something. I think this is 
where the information technology industry has been headed ever since Google 
started using thousands of commodity Linux boxes to index the web and published 
their paper on MapReduce. Organizations are going to come to expect failure 
tolerance, high availability, and linear scaling with cheap computers.

## What's Next?
Chris and I are working on reimplementing a slice of our API with support for 
both a relational database or Cassandra. We'll be able to compare the 
experience and the performance of developing with both, and the server team 
will write up a recommendation from there.

regards,
Zach
_______________________________________________
oae-dev mailing list
oae-dev@collab.sakaiproject.org
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Reply via email to