RE: Why is scaling HBase much simpler then scaling a relational db?

Jonathan Gray Thu, 21 Aug 2008 09:28:21 -0700

A few very big differences...

- HBase/BigTable don't have "transactions" in the same way that a relational 
database does.  While it is possible (and was just recently implemented for 
HBase, see HBASE-669) it is not at the core of this design.  A major bottleneck 
of distributed multi-master relational databases is distributed 
transactions/locks.

- There's a very big difference between storage of relational/row-oriented 
databases and column-oriented databases.  For example, if I have a table of 
'users' and I need to store friendships between these users... In a relational 
database my design is something like:

Table: users(pkey = userid)
Table: friendships(userid,friendid,...) which contains one (or maybe two 
depending on how it's impelemented) row for each friendship.

In order to lookup a given users friend, SELECT * FROM friendships WHERE userid 
= 'myid';

This query would use an index on the friendships table to retrieve all the 
necessary rows.  Depending on the relational database you might also be 
fetching each and every row (entirely) off of disk to be read.  In a sharded 
relational database, this would require hitting every node to get whichever 
friendships were stored on that node.  There's lots of room for optimizations 
here but any way you slice it, you're likely pulling non-sequential blocks off 
disk.  When you add in the overhead of ACID transactions this can get slow.

The cost of this relational query continues to increase as a user adds more 
friends.  You also begin to have practical limits.  If I have millions of 
users, each with many thousands of potential friends, the size of these indexes 
grow exponentially and things get nasty quickly.  Rather than friendships, 
imagine I'm storing activity logs of actions taken by users.

In a column-oriented database these things scale continuously with minimal 
difference between 10 users and 10,000,000 users, 10 friendships and 10,000 
friendships.

Rather than a friendships table, you could just have a friendships column 
family in the users table.  Each column in that family would contain the ID of 
a friend.  The value could store anything else you would have stored in the 
friendships table in the relational model.  As column families are stored 
together/sequentially on a per-row basis, reading a user with 1 friend versus a 
user with 10,000 friends is virtually the same.  The biggest difference is just 
in the shipping of this information across the network which is unavoidable.  
In this system a user could have 10,000,000 friends.  In a relational database 
the size of the friendship table would grow massively and the indexes would be 
out of control.

It's certainly possible to make relational databases "scale".  What that is 
about is usually massive optimizations, manual sharding, being very clever 
about how you query things, and often de-normalizing.  Index bloat and table 
bloat can thrash a relational db.

In HBase, de-normalizing is usually a good thing.  Storage space is often 
considered free (not a large cost associated with storing something in multiple 
places).  In a relational database this is not so much the case.

In HBase, the sharding and distribution come for free out of the box.  With a 
relational database you must jump through hoops and often end up implementing 
your own sharding/distributed hashing algorithms so you can distribute across 
machines.

Yes, you lose the relational primitives.  Sometimes you wish you could do a 
simple join.  But if you get in the right mindset, you learn how to put your 
data into this new data model.  And in the end the payoffs are huge.

In addition to all this, with HBase/Hadoop you get MapReduce.  It's feasible to 
implement something like that on top of a distributed relational database but 
again the complexity is enormous.  With HBase/Hadoop it's a built-in part of 
the system, a system which is very intelligent at keeping logic close to data, 
etc.

To directly answer your question, it's "simpler" to scale HBase because the 
scaling comes for free out of the box.  You get automatic sharding/distribution 
of storage and queries.  There's nothing simpler than that.  Distributing a 
relational database is never simple.

I hope this starts to shed some light on what the differences are.

Jonathan Gray
Streamy Inc.

-----Original Message-----
From: Mork0075 [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 21, 2008 8:48 AM
To: [EMAIL PROTECTED]; hbase-user@hadoop.apache.org
Subject: Re: Why is scaling HBase much simpler then scaling a relational db?

Thank you, but i still don't got it.

I've read tons of websites and papers, but there's no clear und founded 
answer "why use BigTable instead of relational databases".

MySQL Cluster seams to offer the same scalabilty and level of 
abstraction, whithout switching to a non relational pardigm. Lots of 
blog posts are highly emotional, without answering the core question:

"Why RDBMS don't scale and why something like BigTable do". Often you 
read something like this:

"They have also built a system called BigTable, which is a Column 
Oriented Database, which splits a table into columns rather than rows 
making is much simpler to distribute and parallelize."

Why?

Really confusing ... ;)

Stuart Sierra schrieb:
> On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> wrote:
>> Can you please explain, why someone should use HBase for horizontal
>> scaling instead of a relational database? One reason for me would be,
>> that i don't have to implement the sharding logic myself. Are there other?
> 
> A slight tangent -- there are various tools that implement sharding
> over relational databases like MySQL.  Two that I know of are
> DBSlayer,
> http://code.nytimes.com/projects/dbslayer
> and MySQL Proxy,
> http://forge.mysql.com/wiki/MySQL_Proxy
> 
> I don't know of any formal comparisons between sharding traditional
> database servers and distributed databases like HBase.
> -Stuart
>

RE: Why is scaling HBase much simpler then scaling a relational db?

Reply via email to