Re: Cassandra vs HBase

Bradford Stephens Wed, 02 Sep 2009 14:10:15 -0700

Hey there,

Expect a post on my blog in a few days on this subject, with
performance and feature comparisons :)


Cheers,
Bradford

On Wed, Sep 2, 2009 at 11:47 AM, stack<[email protected]> wrote:
> On Wed, Sep 2, 2009 at 9:53 AM, Schubert Zhang <[email protected]> wrote:
>
>> Regardless Cassandra, I want to discuss some questions about
>> HBase/Bigtable.  Any advices are expected.
>>
>> Regards runing MapReduce to scan/analyze big data in HBase.
>>
>> Compared to sequentially reading data from HDFS files directly,
>> scan/sequential-reading data from HBase is slower. (As my test, at least
>> 3:1
>> or 4:1).
>>
>>
> Is it really 3 to 4 times slower?  My guess is that it varies with data
> sizes but, a while back I compared hbase reading and raw reading from
> mapfiles with no hbase in between.  True, I was seeing that mapfiles were
> almost 3x faster when scanning but for other dimensions, hbase was close to
> raw mapfile speeds.
>
> We have some ideas for improving on our current speeds (e.g. we use dfs
> pread getting blocks from hdfs always -- should switch from pread when we
> figure access is scan).  These are in the works.
>
>
>
>> For the data in HBase, it is diffcult to only analyze specified part of
>> data. For example, it is diffcult to only analyze the recent one day of
>> data. In my application, I am considering partition data into different
>> HBase tables (e.g. one day - one table), then, I can only touch one table
>> for analyze via MapReduce.
>> In Google's Bigtable paper, in the "8.1 Google Analytics", they also
>> discribe this usage, but I don't know how.
>>
>
> Time-keyed row keys are a bit tough.  What about adding to the tail of a
> continuing table or does the data come in to fast?  If you could add to the
> end of your table,  you MR against the tail only?  Can you use the version
> dimension of hbase?  Would be full table-scan but would be all server-side
> so should be fast.
>
>
>
>>
>> It is also slower to put flooding data into HBase table than writing to
>> files. (As my test, at least 3:1 or 4:1 too). So, maybe in the future,
>> HBase
>> can provide a bulk-load feature, like PNUTS?
>>
>>
> In my tests -- see up in the wiki -- for sequential write, was less than 2x
> (can't random write into mapfile).
>
> A first cut exists in hbase-48.  There is a reducer which sorts all keys on
> a row and an hfile output format that writes a single file into a region.
> Absent is the necessary partitioner to ensure global sort order.  A generic
> sorter is not possible since the partitioner needs to have knowledge of your
> key space.  Try it and let us know.  Currently it works populating a new
> table only.  Shouldn't be hard to rig it to populate an extant table but I'm
> not going to work on it unless there is interest by others.
>
>
> Many a time, I am thinking, maybe we need a data storage engine, which need
>> not so strong consistency, and it can provide better writing and
>> reading throughput like HDFS. Maybe, we can design another system like a
>> simpler HBase ?
>>
>> You think its the consistency that costs?  HBase is a pretty
> straightforward system as is.  How would you simplify Schubert?  We can work
> on improving the performance to cut down on those 4X and 3Xs that you are
> seeing.  A schema for time-series is a bit tough though if you want to key
> it by timestamp.  You could try Cassandra and let it hash your keys so they
> got distributed around the cluster but my guess is that the scan would be
> slow if you needed to access the content ordered?
>
> Thanks,
> St.Ack
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science

Re: Cassandra vs HBase

Reply via email to