On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon <t...@lipcon.org> wrote: > On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote: >> >> I havn't used either Cassandra or hbase, so please don't take any part of >> this message as me attempting to state facts about either system. However, >> I'm very familiar with data-storage design details, and I've worked >> extensively optimizing applications running on MySQL, Oracle, berkeledb >> (including distributed txn berkeleydb), and Google Bigtable. >> The recent discussion triggered by Facebook messaging using HBase helped >> surface many interesting design differences in the two systems. I'm writing >> this message both to summarize what I've read in a few different places >> about that topic, and to check my facts. >> As far as I can descern, this is a decent summary of the consistency and >> performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3) >> for an hbase acceptable workload.. (Please correct the fact if they appear >> wrong!) >> 1) Cassandra can't replicate the consistency situation of HBase. Namely, >> that when a write requiring a quorum fails it will never appear. Deriving >> from this explanation: >> [In Cassandra]Provided at least one node receives the write, it will >> eventually be written to all replicas. A failure to meet the requested >> ConsistencyLevel is just that; not a failure to write the data itself. Once >> the write is received by a node, it will eventually reach all replicas, >> there is no roll back. - Nick Telford [ref] >> >> [In Hbase] The DFSClient call returns when all datanodes in the pipeline >> have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in >> the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning >> that it's in 0.21.0 - Jean-Daniel Cryans [ref] >> in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the >> region master never receives a response from the other two replicas; and it >> fails the client write, that write should never appear. Even if the region >> master then fails, when a new region master is elected, and it restarts and >> recovers, it should read HDFS blocks and accept the consensus 2/3 opinion >> that the log does not contain the write -- dropping the write. The write >> will never be seen. > > Not quite. The replica synchronization code is pretty messy, but basically > it will take the longest replica that may have been synced, not a quorum. > i.e the guarantee is that "if you successfully sync() data, it will be > present after replica synchronization". Unsynced data *may* be present after > replica synchronization. > But keep in mind that recovery is blocking in most cases - ie if the RS is > writing to a pipeline and waiting on acks, and one of the nodes in the > pipeline dies, then it will recover the pipeline (without the dead node) and > continue syncing to the remaining two nodes. The client is still blocked at > this point. > If the RS itself dies, then it won't respond to the client at all, and it's > anyone's guess whether the write was successful or not. The same is true if > the network between client and RS dies. This is unavoidable in any system - > a server can always fail *just before* sending the "success" message, and > the write is left in "maybe written" state. > What will *not* happen, though, is the following case: > - Row contains value A > - Client writes value B > - RS fails > - Client reads value A > - Client reads again and sees value B > Similarly, if client reads value B, it won't revert to value A in any > circumstance. > >> >> In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only >> one node, that write will fail to the client. Future reads R=1 will see that >> write or not depending on whether they contact the one server that accepted >> or not, until the data is propagated, at which time they will see the write. >> Reads R=2 will not see the write until it is propagated until at least two >> servers. There is no mechanism to assure that a write is either accepted by >> the requested number of servers or aborted. >> 2) Cassandra has a less efficient memory footprint data pinned in memory >> (or cached). With 3 replicas on Cassandra, each element of data pinned >> in-memory is kept in memory on 3 servers, wheras in hbase only region >> masters keep the data in memory, so there is only one-copy of each data >> element. >> 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory >> data. HBase can answer a read-only query that is in memory from the single >> region-master, while Cassandra (N3/W2/R2) must read from multiple servers. >> (note, N3/W2/R2 still doesn't produce the same consistency situation as >> hbase, see #1) > > Yes, probably - except that it seems to me Cassandra should be able to offer > lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC > pause, latency for all rows hosted by that server will spike to 200ms. If > one of three replicas is in a 200ms GC pause, the other two replicas will > still respond quickly so latency should be less spiky in Cassandra. But it's > at the cost of more RAM usage as you mentioned above. > >> >> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable >> again in the face of a node-failure than HBase/HDFS. Cassandra must repair >> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS >> can still acheive a 2 node quorum in the face of a node failure. (note, >> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same >> consistency situation as hbase, see #1.) > > It takes HBase a while to detect the failure and recover the region to a new > server - the recovery time depends on the amount of unflushed data in the > memstores of the failed server. With default configs, it takes 1 minute for > the ZK lease to be lost on a failed server, and then somewhere between 10 > seconds and a few minutes to fully reassign the regions (depending on amount > of data needing to be replayed from the HLog) > >> >> 5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2 >> or N3/W3/R1). In the face of a single machine failure, if it is a region >> master, those keys are offline in HBase until a new region master is elected >> and brought online. In Cassandra, no single node failure causes the data to >> become unavailable. > > True. We're working on the concept of "slave regions" which would handle > some kinds of reads and blind puts during failures. > >> >> Is that summary correct? Am I missing any points? Did I get any facts >> wrong? >> Note, I'm NOT attempting to advocate the following changes, but simply >> understand the design differences.... >> From my uninformed view, it seems that #1 causes the biggest cascade of >> differences, affecting both #3 and #4. If Cassandra were allowed to do what >> HBase/HDFS does, namely to specify a repair-consistency requirement, then >> Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as >> HBase. Further, if Cassandra were allowed to elect one of the copies of data >> as 'master'. then it could require the master participate in all quorum >> writes, allowing reads to be consistent when conducted only through the >> master. This could be a road to address 2/3/4. To my eyes, this would make >> Cassandra capable of operating in a mode which is essentially equivilent to >> HBase/HDFS. Do those facts seem correct? --- AGAIN, I'm not advocating >> Cassandra make these changes, I'm simply trying to understand the >> differences, by considering what changes it would take to make them the >> same. > > On the surface it seems reasonable, but these things are always very tricky, > and it's the subtleties that will kill you :) > -Todd > P.S. Very happy to see informed technical discussion of the differences in > the two architectures!
On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote: > > I already noticed a mistake in my own facts... > On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote: >> >> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable >> again in the face of a node-failure than HBase/HDFS. Cassandra must repair >> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS >> can still acheive a 2 node quorum in the face of a node failure. (note, >> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same >> consistency situation as hbase, see #1.) > > This should read: > 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again > in the face of a node-failure than HBase/HDFS in the face of an HDFS node > failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume > allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face > of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still > doesn't produce the same consistency situation as hbase, see #1.) 2) Cassandra has a less efficient memory footprint data pinned in memory (or cached). With 3 replicas on Cassandra, each element of data pinned in-memory is kept in memory on 3 servers, wheras in hbase only region masters keep the data in memory, so there is only one-copy of each data element. True in the general case, but if you want to optimize for READ.ONE. We can make our caches work like hbase. https://issues.apache.org/jira/browse/CASSANDRA-1314. #4 Not a fair comparison. If hbase is happy working with 2/3 replicas it is not equal to Write.ALL Read.ONE. On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote: > > I already noticed a mistake in my own facts... > On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote: >> >> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable >> again in the face of a node-failure than HBase/HDFS. Cassandra must repair >> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS >> can still acheive a 2 node quorum in the face of a node failure. (note, >> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same >> consistency situation as hbase, see #1.) > > This should read: > 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again > in the face of a node-failure than HBase/HDFS in the face of an HDFS node > failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume > allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face > of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still > doesn't produce the same consistency situation as hbase, see #1.) 2) Cassandra has a less efficient memory footprint data pinned in memory (or cached). With 3 replicas on Cassandra, each element of data pinned in-memory is kept in memory on 3 servers, wheras in hbase only region masters keep the data in memory, so there is only one-copy of each data element. True in the general case, but if you want to optimize for READ.ONE. We can make our caches work like hbase. https://issues.apache.org/jira/browse/CASSANDRA-1314.