Re: HBase vs. Cassandra: new article!

Jonathan Ellis Thu, 29 Oct 2009 15:00:05 -0700

Okay, here are some corrections.  It's a bit choppy because it's just
that; a list of corrections.


Again, this is just trying to address factual errors; I disagree with
many of the expressed opinions, too. :)

> Cassandra relies mostly on Key-Value pairs for storage

No more than hbase does.  Cassandra's columnfamily model does away
with historical values, and adds supercolumns, but the two have a lot
more in commmon with each other than with actual k/v stores.

> it’s a fact that far more people are using HBase than Cassandra at this moment

While it's possible that more people are using HBase right now, with
90 people in the cassandra irc chanel, and 55 in hbase, I'm
comfortable that Cassandra's community is healthy.

> despite both being similarly recent

HBase is roughly 2x as old as Cassandra.

> HBase values strong consistency and High Availability while Cassandra values 
> Availability and Partitioning tolerance

HBase actually picks CP.

> Efficiently running MapReduce on Cassandra, on the other hand, is difficult 
> because all of its keys are in one big “space”, so the MapReduce framework 
> doesn’t know how to split and divide the data natively. There needs to be 
> some hackery in place to handle all of that.

Writing a hadoop input generator is a Feature, to use the article's
terminology.  It doesn't have to be hackish; in fact, trunk now has a
key range splitter that could easily be adapted to Hadoop.

Quoting an old patchset to "prove" that cassandra can only poorly
interface to hadoop is weak.

> Cassandra is only a Ruby gem install away.

Or a tar download, or a deb package...

> You still have to do quite a bit of manual configuration

Other than columnfamily definition (which must also be done for
hbase), I'm not sure what the author was thinking of here.
bin/cassandra works out of the box, and (unlike hbase) there is only
one type of process to deal with, which is a huge win for ops in
production.

> in HBase, if a region server is down, writes will be blocked for affected 
> data until the data is redistributed

(that is why hbase really has CP out of CAP, not CA)

> Cassandra, however, has an internal method of resolving up-to-dateness issues 
> with vector clocks — a complex but workable solution where basically the 
> latest timestamp wins

No; Cassandra uses latest-timestamp-wins, which is totally different
from vector clocks.

> Another architectural quibble is that Cassandra only supports one table per 
> install. That means you can’t denormalize your data to make it more usable in 
> analytical scenarios.

Not even a kernel of truth there.  wtf?

> Cassandra is really more of a Key Value store than a Data Warehouse.

Again: wtf?

> Furthermore, schema changes require a cluster restart

This part is true, for now.  But, misleading since "schema change"
means "adding CFs or keyspaces," not merely "modifying columns" like
in traditional dbs.

> it’s difficult to claim that Cassandra implements the BigTable model

We never claimed to be a pure bigtable clone.  We don't want to be,
because of the single points of failures and operational complexity
involved.

> Cassandra is optimized for small datacenters (hundreds of nodes) connected by 
> very fast fiber. HBase, being based on research originally published by 
> Google, is happy to handle replication to thousands of planet-strewn nodes 
> across the ’slow’, unpredictable Internet

Cassandra has multi-datacenter support already.  HBase didn't, last I
checked.  So this is weird.

> This first diagram is a model of the Cassandra replication scheme.

Note that all these steps happen in parallel.

> it’s impossible to tell when the required number of replicas will be 
> up-to-date. This can be extremely painful in a live situation — when one of 
> your DCs goes down, you often want to know *exactly* when to expect data 
> consistency

Cassandra provides consistency when R + W > N (read replica count +
write replica count > replication factor).  If you do writes and reads
both with QUORUM, for one example, you can expect data consistency as
soon as there are enough nodes for a quorum (which may not even
require the DC to be online).  That is not "impossible to tell" at
all.

> It’s important to note that Cassandra relies on high-speed fiber between 
> datacenters.

Simply flat-out wrong.

> If your writes are taking 1 or 2 ms, that’s fine. But when a DC goes out and 
> you have to revert to a secondary one in China instead of 20 miles away, the 
> incredible latency will lead to write timeouts and highly inconsistent data.

Sure, "incredible" latency of 100ms or so is bad, but it's not the end
of the world, and won't cause either write timeouts or inconsistent
data, assuming that you are in fact using R + W > N.

Re: HBase vs. Cassandra: new article!

Reply via email to