HBase does not do in-memory replication. Your data goes into a region, which has only one instance. Writes go to the write ahead log first, which is written to the disk. However, since HDFS doesnt yet have a fully performing flush functionality, there is a chance of losing the chunk of data. The next release of HBase will guarantee data durability since by then the flush functionality would be fully working.
Regarding replication - the difference between Cassandra and HBase is that when you do a write in Cassandra, it doesnt return unless it has written to W nodes, which is configurable. In case of HBase, the replication is taken care of by the filesystem (HDFS). When the region is flushed to the disk, HDFS replicates the HFiles (in which the data for the regions is stored). For more details of the working, read the Bigtable paper and http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html. 2010/5/8 MauMau <maumau...@gmail.com> > Hello, > > I'm comparing HBase and Cassandra, which I think are the most promising > distributed key-value stores, to determine which one to choose for the > future OLTP and data analysis. > I found the following benchmark report by Yahoo! Research which evalutes > HBase, Cassandra, PNUTS, and sharded MySQL. > > http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf > http://www.brianfrankcooper.net/pubs/ycsb.pdf > > The above report refers to HBase 0.20.3. > Reading this and HBase's documentation, two questions about load balancing > and replication have risen. Could anyone give me any information to help > solve these questions? > > [Q2] replication > Does HBase perform in-memory replication of rows like Cassandra? > Does HBase sync updates to disk before returing success to clients? > > According to the following paragraph in HBase design overview, HBase syncs > writes. > > ---------------------------------------- > Write Requests > When a write request is received, it is first written to a write-ahead log > called a HLog. All write requests for every region the region server is > serving are written to the same HLog. Once the request has been written to > the HLog, the result of changes is stored in an in-memory cache called the > Memcache. There is one Memcache for each Store. > ---------------------------------------- > > The source code of Put class appear to show the above (though I don't > understand the server-side code yet): > > private boolean writeToWAL = true; > > However, Yahoo's report writes as follows. Is this incorrect? What is > in-memory replication? I know HBase relies on HDFS to replicate data on the > storage, but not in memory. > > ---------------------------------------- > For Cassandra, sharded MySQL and PNUTS, all updates were > synched to disk before returning to the client. HBase does > not sync to disk, but relies on in-memory replication across > multiple servers for durability; this increases write throughput > and reduces latency, but can result in data loss on failure. > ---------------------------------------- > > Maumau > >