I havn't used either Cassandra or hbase, so please don't take any part of
this message as me attempting to state facts about either system. However,
I'm very familiar with data-storage design details, and I've worked
extensively optimizing applications running on MySQL, Oracle, berkeledb
(including distributed txn berkeleydb), and Google Bigtable.

The recent discussion triggered by Facebook messaging using HBase helped
surface many interesting design differences in the two systems. I'm writing
this message both to summarize what I've read in a few different places
about that topic, and to check my facts.

As far as I can descern, this is a decent summary of the consistency and
performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3)
for an hbase acceptable workload.. (Please correct the fact if they appear
wrong!)

*1) Cassandra can't replicate the consistency situation of HBase.* Namely,
that when a write requiring a quorum fails it will never appear. Deriving
from this explanation:

[In Cassandra]Provided at least one node receives the write, it will
eventually be written to all replicas. A failure to meet the requested
ConsistencyLevel is just that; not a failure to write the data itself. Once
the write is received by a node, it will eventually reach all replicas,
there is no roll back. - Nick Telford
[ref<http://www.mail-archive.com/user@cassandra.apache.org/msg07398.html>
]

[In Hbase] The DFSClient call returns when all datanodes in the pipeline
have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in
the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning
that it's in 0.21.0 - Jean-Daniel Cryans
[ref<http://www.quora.com/How-does-HBase-write-performance-differ-from-write-performance-in-Cassandra-with-consistency-level-ALL?q=hbase+cassandra>
]

in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the
region master never receives a response from the other two replicas; and it
fails the client write, that write should never appear. Even if the region
master then fails, when a new region master is elected, and it restarts and
recovers, it should read HDFS blocks and accept the consensus 2/3 opinion
that the log does not contain the write -- dropping the write. The write
will never be seen.

In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only one
node, that write will fail to the client. Future reads R=1 will see that
write or not depending on whether they contact the one server that accepted
or not, until the data is propagated, at which time they will see the write.
Reads R=2 will not see the write until it is propagated until at least two
servers. There is no mechanism to assure that a write is either accepted by
the requested number of servers or aborted.

*2) Cassandra has a less efficient memory footprint data pinned in memory
(or cached).* With 3 replicas on Cassandra, each element of data pinned
in-memory is kept in memory on 3 servers, wheras in hbase only region
masters keep the data in memory, so there is only one-copy of each data
element.

*3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory
data.* HBase can answer a read-only query that is in memory from the single
region-master, while Cassandra (N3/W2/R2) must read from multiple servers.
 (note, N3/W2/R2 still doesn't produce the same consistency situation as
hbase, see #1)

*4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again
in the face of a node-failure than HBase/HDFS.* Cassandra must repair the
keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS can
still acheive a 2 node quorum in the face of a node failure. (note, using
N3/W2 requires R2, see #3) (note, this still doesn't produce the same
consistency situation as hbase, see #1.)

*5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2
or N3/W3/R1).* In the face of a single machine failure, if it is a region
master, those keys are offline in HBase until a new region master is elected
and brought online. In Cassandra, no single node failure causes the data to
become unavailable.


Is that summary correct? Am I missing any points? Did I get any facts wrong?

Note, I'm NOT attempting to advocate the following changes, but simply
understand the design differences....

>From my uninformed view, it seems that #1 causes the biggest cascade of
differences, affecting both #3 and #4. If Cassandra were allowed to do what
HBase/HDFS does, namely to specify a repair-consistency requirement, then
Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as
HBase. Further, if Cassandra were allowed to elect one of the copies of data
as 'master'. then it could require the master participate in all quorum
writes, allowing reads to be consistent when conducted only through the
master. This could be a road to address 2/3/4. To my eyes, this would make
Cassandra capable of operating in a mode which is essentially equivilent to
HBase/HDFS. Do those facts seem correct?  --- AGAIN, I'm not advocating
Cassandra make these changes, I'm simply trying to understand the
differences, by considering what changes it would take to make them the
same.

Thanks for all your great work on Cassandra!

Reply via email to