Re: Why is scaling HBase much simpler then scaling a relational db?

Mork0075 Wed, 27 Aug 2008 00:57:53 -0700

I'am still really interested in these three questions :)

> You example with the friends makes perfectly sense. Can you imagine a
> scenario where storing the data in column oriented instead of row
> oriented db (so if you will an counterexample) causes such a huge
> performance mismatch, like the friends one in row/column comparison?



> Can you please provide an example of "good de-normalization" in HBase
> and how its held consitent (in your friends example in a relational db,
> there would be a cascadingDelete)? As i think of the users table: if i
> delete an user with the userid='123', then if have to walk through all
> of the other users column-family "friends" to guranty consitency?! Is
> de-normalization in HBase only used to avoid joins? Our webapp doenst
> use joins at the moment anyway.

> As you describe it, its a problem of implementation. BigTable is
> designed to scale, there are routines to shard the data, desitribute
> it to the pool of connected servers. Could MySQL perhaps decide
> tomorrow to implement something similar or does the relational model
> avoids this?


> 
> Jonathan Gray schrieb:
>> A few very big differences...
>>
>> - HBase/BigTable don't have "transactions" in the same way that a
>> relational database does.  While it is possible (and was just recently
>> implemented for HBase, see HBASE-669) it is not at the core of this
>> design.  A major bottleneck of distributed multi-master relational
>> databases is distributed transactions/locks.
>>
>> - There's a very big difference between storage of
>> relational/row-oriented databases and column-oriented databases.  For
>> example, if I have a table of 'users' and I need to store friendships
>> between these users... In a relational database my design is something
>> like:
>>
>> Table: users(pkey = userid)
>> Table: friendships(userid,friendid,...) which contains one (or maybe
>> two depending on how it's impelemented) row for each friendship.
>>
>> In order to lookup a given users friend, SELECT * FROM friendships
>> WHERE userid = 'myid';
>>
>> This query would use an index on the friendships table to retrieve all
>> the necessary rows.  Depending on the relational database you might
>> also be fetching each and every row (entirely) off of disk to be
>> read.  In a sharded relational database, this would require hitting
>> every node to get whichever friendships were stored on that node. 
>> There's lots of room for optimizations here but any way you slice it,
>> you're likely pulling non-sequential blocks off disk.  When you add in
>> the overhead of ACID transactions this can get slow.
>>
>> The cost of this relational query continues to increase as a user adds
>> more friends.  You also begin to have practical limits.  If I have
>> millions of users, each with many thousands of potential friends, the
>> size of these indexes grow exponentially and things get nasty
>> quickly.  Rather than friendships, imagine I'm storing activity logs
>> of actions taken by users.
>>
>> In a column-oriented database these things scale continuously with
>> minimal difference between 10 users and 10,000,000 users, 10
>> friendships and 10,000 friendships.
>>
>> Rather than a friendships table, you could just have a friendships
>> column family in the users table.  Each column in that family would
>> contain the ID of a friend.  The value could store anything else you
>> would have stored in the friendships table in the relational model. 
>> As column families are stored together/sequentially on a per-row
>> basis, reading a user with 1 friend versus a user with 10,000 friends
>> is virtually the same.  The biggest difference is just in the shipping
>> of this information across the network which is unavoidable.  In this
>> system a user could have 10,000,000 friends.  In a relational database
>> the size of the friendship table would grow massively and the indexes
>> would be out of control.
>>
>>
>> It's certainly possible to make relational databases "scale".  What
>> that is about is usually massive optimizations, manual sharding, being
>> very clever about how you query things, and often de-normalizing. 
>> Index bloat and table bloat can thrash a relational db.
>>
>> In HBase, de-normalizing is usually a good thing.  Storage space is
>> often considered free (not a large cost associated with storing
>> something in multiple places).  In a relational database this is not
>> so much the case.
>>
>> In HBase, the sharding and distribution come for free out of the box. 
>> With a relational database you must jump through hoops and often end
>> up implementing your own sharding/distributed hashing algorithms so
>> you can distribute across machines.
>>
>> Yes, you lose the relational primitives.  Sometimes you wish you could
>> do a simple join.  But if you get in the right mindset, you learn how
>> to put your data into this new data model.  And in the end the payoffs
>> are huge.
>>
>>
>> In addition to all this, with HBase/Hadoop you get MapReduce.  It's
>> feasible to implement something like that on top of a distributed
>> relational database but again the complexity is enormous.  With
>> HBase/Hadoop it's a built-in part of the system, a system which is
>> very intelligent at keeping logic close to data, etc.
>>
>>
>> To directly answer your question, it's "simpler" to scale HBase
>> because the scaling comes for free out of the box.  You get automatic
>> sharding/distribution of storage and queries.  There's nothing simpler
>> than that.  Distributing a relational database is never simple.
>>
>> I hope this starts to shed some light on what the differences are.
>>
>>
>> Jonathan Gray
>> Streamy Inc.
>>
>> -----Original Message-----
>> From: Mork0075 [mailto:[EMAIL PROTECTED] Sent: Thursday, August
>> 21, 2008 8:48 AM
>> To: [EMAIL PROTECTED]; hbase-user@hadoop.apache.org
>> Subject: Re: Why is scaling HBase much simpler then scaling a
>> relational db?
>>
>> Thank you, but i still don't got it.
>>
>> I've read tons of websites and papers, but there's no clear und
>> founded answer "why use BigTable instead of relational databases".
>>
>> MySQL Cluster seams to offer the same scalabilty and level of
>> abstraction, whithout switching to a non relational pardigm. Lots of
>> blog posts are highly emotional, without answering the core question:
>>
>> "Why RDBMS don't scale and why something like BigTable do". Often you
>> read something like this:
>>
>> "They have also built a system called BigTable, which is a Column
>> Oriented Database, which splits a table into columns rather than rows
>> making is much simpler to distribute and parallelize."
>>
>> Why?
>>
>> Really confusing ... ;)
>>
>> Stuart Sierra schrieb:
>>> On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]>
>>> wrote:
>>>> Can you please explain, why someone should use HBase for horizontal
>>>> scaling instead of a relational database? One reason for me would be,
>>>> that i don't have to implement the sharding logic myself. Are there
>>>> other?
>>> A slight tangent -- there are various tools that implement sharding
>>> over relational databases like MySQL.  Two that I know of are
>>> DBSlayer,
>>> http://code.nytimes.com/projects/dbslayer
>>> and MySQL Proxy,
>>> http://forge.mysql.com/wiki/MySQL_Proxy
>>>
>>> I don't know of any formal comparisons between sharding traditional
>>> database servers and distributed databases like HBase.
>>> -Stuart
>>>
>>
>>
> 
>

Re: Why is scaling HBase much simpler then scaling a relational db?

Reply via email to