On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon <t...@lipcon.org> wrote:
> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
>>
>> I havn't used either Cassandra or hbase, so please don't take any part of
>> this message as me attempting to state facts about either system. However,
>> I'm very familiar with data-storage design details, and I've worked
>> extensively optimizing applications running on MySQL, Oracle, berkeledb
>> (including distributed txn berkeleydb), and Google Bigtable.
>> The recent discussion triggered by Facebook messaging using HBase helped
>> surface many interesting design differences in the two systems. I'm writing
>> this message both to summarize what I've read in a few different places
>> about that topic, and to check my facts.
>> As far as I can descern, this is a decent summary of the consistency and
>> performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3)
>> for an hbase acceptable workload.. (Please correct the fact if they appear
>> wrong!)
>> 1) Cassandra can't replicate the consistency situation of HBase. Namely,
>> that when a write requiring a quorum fails it will never appear. Deriving
>> from this explanation:
>> [In Cassandra]Provided at least one node receives the write, it will
>> eventually be written to all replicas. A failure to meet the requested
>> ConsistencyLevel is just that; not a failure to write the data itself. Once
>> the write is received by a node, it will eventually reach all replicas,
>> there is no roll back. - Nick Telford [ref]
>>
>> [In Hbase] The DFSClient call returns when all datanodes in the pipeline
>> have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in
>> the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning
>> that it's in 0.21.0 - Jean-Daniel Cryans [ref]
>> in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the
>> region master never receives a response from the other two replicas; and it
>> fails the client write, that write should never appear. Even if the region
>> master then fails, when a new region master is elected, and it restarts and
>> recovers, it should read HDFS blocks and accept the consensus 2/3 opinion
>> that the log does not contain the write -- dropping the write. The write
>> will never be seen.
>
> Not quite. The replica synchronization code is pretty messy, but basically
> it will take the longest replica that may have been synced, not a quorum.
> i.e the guarantee is that "if you successfully sync() data, it will be
> present after replica synchronization". Unsynced data *may* be present after
> replica synchronization.
> But keep in mind that recovery is blocking in most cases - ie if the RS is
> writing to a pipeline and waiting on acks, and one of the nodes in the
> pipeline dies, then it will recover the pipeline (without the dead node) and
> continue syncing to the remaining two nodes. The client is still blocked at
> this point.
> If the RS itself dies, then it won't respond to the client at all, and it's
> anyone's guess whether the write was successful or not. The same is true if
> the network between client and RS dies. This is unavoidable in any system -
> a server can always fail *just before* sending the "success" message, and
> the write is left in "maybe written" state.
> What will *not* happen, though, is the following case:
> - Row contains value A
> - Client writes value B
> - RS fails
> - Client reads value A
> - Client reads again and sees value B
> Similarly, if client reads value B, it won't revert to value A in any
> circumstance.
>
>>
>> In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only
>> one node, that write will fail to the client. Future reads R=1 will see that
>> write or not depending on whether they contact the one server that accepted
>> or not, until the data is propagated, at which time they will see the write.
>> Reads R=2 will not see the write until it is propagated until at least two
>> servers. There is no mechanism to assure that a write is either accepted by
>> the requested number of servers or aborted.
>> 2) Cassandra has a less efficient memory footprint data pinned in memory
>> (or cached). With 3 replicas on Cassandra, each element of data pinned
>> in-memory is kept in memory on 3 servers, wheras in hbase only region
>> masters keep the data in memory, so there is only one-copy of each data
>> element.
>> 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory
>> data. HBase can answer a read-only query that is in memory from the single
>> region-master, while Cassandra (N3/W2/R2) must read from multiple servers.
>>  (note, N3/W2/R2 still doesn't produce the same consistency situation as
>> hbase, see #1)
>
> Yes, probably - except that it seems to me Cassandra should be able to offer
> lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC
> pause, latency for all rows hosted by that server will spike to 200ms. If
> one of three replicas is in a 200ms GC pause, the other two replicas will
> still respond quickly so latency should be less spiky in Cassandra. But it's
> at the cost of more RAM usage as you mentioned above.
>
>>
>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable
>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair
>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS
>> can still acheive a 2 node quorum in the face of a node failure. (note,
>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same
>> consistency situation as hbase, see #1.)
>
> It takes HBase a while to detect the failure and recover the region to a new
> server - the recovery time depends on the amount of unflushed data in the
> memstores of the failed server. With default configs, it takes 1 minute for
> the ZK lease to be lost on a failed server, and then somewhere between 10
> seconds and a few minutes to fully reassign the regions (depending on amount
> of data needing to be replayed from the HLog)
>
>>
>> 5) HBase can't match the row-availability situation of Cassandra(N3/W2/R2
>> or N3/W3/R1). In the face of a single machine failure, if it is a region
>> master, those keys are offline in HBase until a new region master is elected
>> and brought online. In Cassandra, no single node failure causes the data to
>> become unavailable.
>
> True. We're working on the concept of "slave regions" which would handle
> some kinds of reads and blind puts during failures.
>
>>
>> Is that summary correct? Am I missing any points? Did I get any facts
>> wrong?
>> Note, I'm NOT attempting to advocate the following changes, but simply
>> understand the design differences....
>> From my uninformed view, it seems that #1 causes the biggest cascade of
>> differences, affecting both #3 and #4. If Cassandra were allowed to do what
>> HBase/HDFS does, namely to specify a repair-consistency requirement, then
>> Cassandra (N3/W2/R2/Repair2) should be the same consistency guarantee as
>> HBase. Further, if Cassandra were allowed to elect one of the copies of data
>> as 'master'. then it could require the master participate in all quorum
>> writes, allowing reads to be consistent when conducted only through the
>> master. This could be a road to address 2/3/4. To my eyes, this would make
>> Cassandra capable of operating in a mode which is essentially equivilent to
>> HBase/HDFS. Do those facts seem correct?  --- AGAIN, I'm not advocating
>> Cassandra make these changes, I'm simply trying to understand the
>> differences, by considering what changes it would take to make them the
>> same.
>
> On the surface it seems reasonable, but these things are always very tricky,
> and it's the subtleties that will kill you :)
> -Todd
> P.S. Very happy to see informed technical discussion of the differences in
> the two architectures!




On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote:
>
> I already noticed a mistake in my own facts...
> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
>>
>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable
>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair
>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS
>> can still acheive a 2 node quorum in the face of a node failure. (note,
>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same
>> consistency situation as hbase, see #1.)
>
> This should read:
> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again
> in the face of a node-failure than HBase/HDFS in the face of an HDFS node
> failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume
> allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face
> of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still
> doesn't produce the same consistency situation as hbase, see #1.)

2) Cassandra has a less efficient memory footprint data pinned in
memory (or cached). With 3 replicas on Cassandra, each element of data
pinned in-memory is kept in memory on 3 servers, wheras in hbase only
region masters keep the data in memory, so there is only one-copy of
each data element.

True in the general case, but if you want to optimize for READ.ONE. We
can make our caches work like hbase.
https://issues.apache.org/jira/browse/CASSANDRA-1314.

#4 Not a fair comparison. If hbase is happy working with 2/3 replicas
it is not equal to Write.ALL Read.ONE.



On Mon, Nov 22, 2010 at 1:06 PM, David Jeske <dav...@gmail.com> wrote:
>
> I already noticed a mistake in my own facts...
> On Mon, Nov 22, 2010 at 10:01 AM, David Jeske <dav...@gmail.com> wrote:
>>
>> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable
>> again in the face of a node-failure than HBase/HDFS. Cassandra must repair
>> the keyrange to bring N from 2 to 3 to resume allowing writes with W=3. HDFS
>> can still acheive a 2 node quorum in the face of a node failure. (note,
>> using N3/W2 requires R2, see #3) (note, this still doesn't produce the same
>> consistency situation as hbase, see #1.)
>
> This should read:
> 4) Cassandra (N3/W3/R1) takes longer to allow data to become writable again
> in the face of a node-failure than HBase/HDFS in the face of an HDFS node
> failure. Cassandra must repair the keyrange to bring N from 2 to 3 to resume
> allowing writes with W=3. HDFS can still acheive a 2 node quorum in the face
> of a node failure. (note, using N3/W2 requires R2, see #3) (note, this still
> doesn't produce the same consistency situation as hbase, see #1.)

2) Cassandra has a less efficient memory footprint data pinned in
memory (or cached). With 3 replicas on Cassandra, each element of data
pinned in-memory is kept in memory on 3 servers, wheras in hbase only
region masters keep the data in memory, so there is only one-copy of
each data element.

True in the general case, but if you want to optimize for READ.ONE. We
can make our caches work like hbase.
https://issues.apache.org/jira/browse/CASSANDRA-1314.

Reply via email to