Hello,
I'm comparing HBase and Cassandra, which I think are the most promising
distributed key-value stores, to determine which one to choose for the
future OLTP and data analysis.
I found the following benchmark report by Yahoo! Research which evalutes
HBase, Cassandra, PNUTS, and sharded MySQL.
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
http://www.brianfrankcooper.net/pubs/ycsb.pdf
The above report refers to HBase 0.20.3.
Reading this and HBase's documentation, two questions about load balancing
and replication have risen. Could anyone give me any information to help
solve these questions?
[Q2] replication
Does HBase perform in-memory replication of rows like Cassandra?
Does HBase sync updates to disk before returing success to clients?
According to the following paragraph in HBase design overview, HBase syncs
writes.
----------------------------------------
Write Requests
When a write request is received, it is first written to a write-ahead log
called a HLog. All write requests for every region the region server is
serving are written to the same HLog. Once the request has been written to
the HLog, the result of changes is stored in an in-memory cache called the
Memcache. There is one Memcache for each Store.
----------------------------------------
The source code of Put class appear to show the above (though I don't
understand the server-side code yet):
private boolean writeToWAL = true;
However, Yahoo's report writes as follows. Is this incorrect? What is
in-memory replication? I know HBase relies on HDFS to replicate data on the
storage, but not in memory.
----------------------------------------
For Cassandra, sharded MySQL and PNUTS, all updates were
synched to disk before returning to the client. HBase does
not sync to disk, but relies on in-memory replication across
multiple servers for durability; this increases write throughput
and reduces latency, but can result in data loss on failure.
----------------------------------------
Maumau