Re: Cassandra vs HBase

Andrew Purtell Wed, 02 Sep 2009 11:34:51 -0700

Hi Sylvain,

I said this last week on another thread and feel it is quite apt here in 
response to you:

"HBase like other projects in this area are in an early stage of
development. They cover the use cases of their creators but, as answers
to the larger set of problems, they are not -- that space is untapped
and only waiting for creativity and effort. I
think I can speak for HBase in particular, we welcome this and would be pleased 
to assist at every opportunity."

Best regards,

    - Andy

________________________________
From: Sylvain Hellegouarch <[email protected]>
To: [email protected]
Sent: Wednesday, September 2, 2009 10:15:41 AM
Subject: Re: Cassandra vs HBase

I must admit, I'm left as puzzled as you are. Our current use case at 
work involve large amount of small event log writing. Of course HDFS was 
quickly out of question since it's not there yet to append to a file and 
more generally to handle large amount of small write ops.

So we decided with HBase because we trust the Hadoop/HBase 
infrastructure will offer us the robustness and reliability we need. 
That being said, I'm not feeling at ease in regards to the capacity of 
HBase to handle the potential load we are looking at inputing.

In fact, it's a common treat of such systems, they've been designed with 
a certain use case in mind and sometimes I feel like their design and 
implementation leak way too much on our infrastructure, leading us down 
the path of a virtual lock-in.

Now I am not accusing anyone here, just observing that I find it really 
hard to locate any industrial story of those systems in a similar use 
case we have at hand.

The number of nodes this or that company has doesn't quite interest me 
as much as the way they are actually using HBase and Hadoop.

RDBMS don't scale as well but they've got a long history and people do 
know how to optimise, use and manage them. It seems column-oriented 
database systems are still young :)

- Sylvain

Schubert Zhang a écrit :
> Regardless Cassandra, I want to discuss some questions about
> HBase/Bigtable.  Any advices are expected.
>
> Regards runing MapReduce to scan/analyze big data in HBase.
>
> Compared to sequentially reading data from HDFS files directly,
> scan/sequential-reading data from HBase is slower. (As my test, at least 3:1
> or 4:1).
>
> For the data in HBase, it is diffcult to only analyze specified part of
> data. For example, it is diffcult to only analyze the recent one day of
> data. In my application, I am considering partition data into different
> HBase tables (e.g. one day - one table), then, I can only touch one table
> for analyze via MapReduce.
> In Google's Bigtable paper, in the "8.1 Google Analytics", they also
> discribe this usage, but I don't know how.
>
> It is also slower to put flooding data into HBase table than writing to
> files. (As my test, at least 3:1 or 4:1 too). So, maybe in the future, HBase
> can provide a bulk-load feature, like PNUTS?
>
> Many people suggest us to only store metadata into HBase tables, and leave
> data in HDFS files, because our time-series dataset is very big.  I
> understand this idea make sense for some simple application requirements.
> But usually, I want different indexes to the raw data. It is diffcult to
> build such indexes if the the raw data files (which are raw or are
> reconstructed via MapReduce  periodically on recent data ) are not totally
> sorted.  .... HBase can provide us many expected features: sorted,
> distributed b-tree, compact/merge.
>
> So, it is very difficult for me to make trade-off.
> If I store data in HDFS files (may be partitioned), and metadata/index in
> HBase. The metadata/index is very difficult to be build.
> If I rely on HBase totally, the performance of ingesting-data and
> scaning-data is not good. Is it reasonable to do MapReduce on HBase? We know
> the goal of HBase is to provide random access over HDFS, and it is a
> extention or adaptor over HDFS.
>
> ----
> Many a time, I am thinking, maybe we need a data storage engine, which need
> not so strong consistency, and it can provide better writing and
> reading throughput like HDFS. Maybe, we can design another system like a
> simpler HBase ?
>
> Schubert
>
> On Wed, Sep 2, 2009 at 8:56 AM, Andrew Purtell <[email protected]> wrote:
>
>  
>> To be precise, S3. http://status.aws.amazon.com/s3-20080720.html
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Andrew Purtell <[email protected]>
>> To: [email protected]
>> Sent: Tuesday, September 1, 2009 5:53:09 PM
>> Subject: Re: Cassandra vs HBase
>>
>>
>> Right... I recall an incident in AWS where a malformed gossip packet took
>> down all of Dynamo. Seems that even P2P doesn't mitigate against corner
>> cases.
>>
>>
>> On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis <[email protected]> wrote:
>>
>>    
>>> The big win for Cassandra is that its p2p distribution model -- which
>>> drives the consistency model -- means there is no single point of
>>> failure.  SPF can be mitigated by failover but it's really, really
>>> hard to get all the corner cases right with that approach.  Even
>>> Google with their 3 year head start and huge engineering resources
>>> still has trouble with that occasionally.  (See e.g.
>>> http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179.)
>>>      
>>
>>
>>    
>
>

Re: Cassandra vs HBase

Reply via email to