I havn't used either Cassandra or hbase, so please don't take any part of this message as me attempting to state facts about either system. However, I'm very familiar with data-storage design details, and I've worked extensively optimizing applications running on MySQL, Oracle, berkeledb (including distributed txn berkeleydb), and Google Bigtable.
The recent discussion triggered by Facebook messaging using HBase helped surface many interesting design differences in the two systems. I'm writing this message both to summarize what I've read in a few different places about that topic, and to check my facts. As far as I can descern, this is a decent summary of the consistency and performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3) for an hbase acceptable workload.. (Please correct the fact if they appear wrong!) *1) Cassandra can't replicate the consistency situation of HBase.* Namely, that when a write requiring a quorum fails it will never appear. Deriving from this explanation: [In Cassandra]Provided at least one node receives the write, it will eventually be written to all replicas. A failure to meet the requested ConsistencyLevel is just that; not a failure to write the data itself. Once the write is received by a node, it will eventually reach all replicas, there is no roll back. - Nick Telford [ref<http://www.mail-archive.com/user@cassandra.apache.org/msg07398.html> ] [In Hbase] The DFSClient call returns when all datanodes in the pipeline have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning that it's in 0.21.0 - Jean-Daniel Cryans [ref<http://www.quora.com/How-does-HBase-write-performance-differ-from-write-performance-in-Cassandra-with-consistency-level-ALL?q=hbase+cassandra> ] in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the region master never receives a response from the other two replicas; and it fails the client write, that write should never appear. Even if the region master then fails, when a new region master is elected, and it restarts and recovers, it should read HDFS blocks and accept the consensus 2/3 opinion that the log does not contain the write -- dropping the write. The write will never be seen. In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only one node, that write will fail to the client. Future reads R=1 will see that write or not depending on whether they contact the one server that accepted or not, until the data is propagated, at which time they will see the write. Reads R=2 will not see the write until it is propagated until at least two servers. There is no mechanism to assure that a write is either accepted by the requested number of servers or aborted. *2) Cassandra has a less efficient memory footprint data pinned in memory (or cached).* With 3 replicas on Cassandra, each element of data pinned in-memory is kept in memory on 3 servers, wheras in hbase only region masters keep the data in memory, so there is only one-copy of each data element. *3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory data.* HBase can answer a read-only query that is in memory from the single region-master, while Cassandra (N3/W2/R2) must read from multiple servers. (note, N3/W2/R2 still doesn't produce the same consistency situation as hbase, see #1) *4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again in the face of a node-failure than HBase/HDFS.* Cassandra must repair the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still doesn't produce the same consistency situation as hbase, see #1.) *5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2 or N3/W3/R1).* In the face of a single machine failure, if it is a region master, those keys are offline in HBase until a new region master is elected and brought online. In Cassandra, no single node failure causes the data to become unavailable. Is that summary correct? Am I missing any points? Did I get any facts wrong? Note, I'm NOT attempting to advocate the following changes, but simply understand the design differences.... >From my uninformed view, it seems that #1 causes the biggest cascade of differences, affecting both #3 and #4. If Cassandra were allowed to do what HBase/HDFS does, namely to specify a repair-consistency requirement, then Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as HBase. Further, if Cassandra were allowed to elect one of the copies of data as 'master'. then it could require the master participate in all quorum writes, allowing reads to be consistent when conducted only through the master. This could be a road to address 2/3/4. To my eyes, this would make Cassandra capable of operating in a mode which is essentially equivilent to HBase/HDFS. Do those facts seem correct? --- AGAIN, I'm not advocating Cassandra make these changes, I'm simply trying to understand the differences, by considering what changes it would take to make them the same. Thanks for all your great work on Cassandra!