This answer is very, very good!

You example with the friends makes perfectly sense. Can you imagine a scenario where storing the data in column oriented instead of row oriented db (so if you will an counterexample) causes such a huge performance mismatch, like the friends one in row/column comparison?

Can you please provide an example of "good de-normalization" in HBase and how its held consitent (in your friends example in a relational db, there would be a cascadingDelete)? As i think of the users table: if i delete an user with the userid='123', then if have to walk through all of the other users column-family "friends" to guranty consitency?! Is de-normalization in HBase only used to avoid joins? Our webapp doenst use joins at the moment anyway.


Jonathan Gray schrieb:
A few very big differences...

- HBase/BigTable don't have "transactions" in the same way that a relational 
database does.  While it is possible (and was just recently implemented for HBase, see 
HBASE-669) it is not at the core of this design.  A major bottleneck of distributed 
multi-master relational databases is distributed transactions/locks.

- There's a very big difference between storage of relational/row-oriented 
databases and column-oriented databases.  For example, if I have a table of 
'users' and I need to store friendships between these users... In a relational 
database my design is something like:

Table: users(pkey = userid)
Table: friendships(userid,friendid,...) which contains one (or maybe two 
depending on how it's impelemented) row for each friendship.

In order to lookup a given users friend, SELECT * FROM friendships WHERE userid 
= 'myid';

This query would use an index on the friendships table to retrieve all the 
necessary rows.  Depending on the relational database you might also be 
fetching each and every row (entirely) off of disk to be read.  In a sharded 
relational database, this would require hitting every node to get whichever 
friendships were stored on that node.  There's lots of room for optimizations 
here but any way you slice it, you're likely pulling non-sequential blocks off 
disk.  When you add in the overhead of ACID transactions this can get slow.

The cost of this relational query continues to increase as a user adds more 
friends.  You also begin to have practical limits.  If I have millions of 
users, each with many thousands of potential friends, the size of these indexes 
grow exponentially and things get nasty quickly.  Rather than friendships, 
imagine I'm storing activity logs of actions taken by users.

In a column-oriented database these things scale continuously with minimal 
difference between 10 users and 10,000,000 users, 10 friendships and 10,000 
friendships.

Rather than a friendships table, you could just have a friendships column 
family in the users table.  Each column in that family would contain the ID of 
a friend.  The value could store anything else you would have stored in the 
friendships table in the relational model.  As column families are stored 
together/sequentially on a per-row basis, reading a user with 1 friend versus a 
user with 10,000 friends is virtually the same.  The biggest difference is just 
in the shipping of this information across the network which is unavoidable.  
In this system a user could have 10,000,000 friends.  In a relational database 
the size of the friendship table would grow massively and the indexes would be 
out of control.


It's certainly possible to make relational databases "scale".  What that is 
about is usually massive optimizations, manual sharding, being very clever about how you 
query things, and often de-normalizing.  Index bloat and table bloat can thrash a 
relational db.

In HBase, de-normalizing is usually a good thing.  Storage space is often 
considered free (not a large cost associated with storing something in multiple 
places).  In a relational database this is not so much the case.

In HBase, the sharding and distribution come for free out of the box.  With a 
relational database you must jump through hoops and often end up implementing 
your own sharding/distributed hashing algorithms so you can distribute across 
machines.

Yes, you lose the relational primitives.  Sometimes you wish you could do a 
simple join.  But if you get in the right mindset, you learn how to put your 
data into this new data model.  And in the end the payoffs are huge.


In addition to all this, with HBase/Hadoop you get MapReduce.  It's feasible to 
implement something like that on top of a distributed relational database but 
again the complexity is enormous.  With HBase/Hadoop it's a built-in part of 
the system, a system which is very intelligent at keeping logic close to data, 
etc.


To directly answer your question, it's "simpler" to scale HBase because the 
scaling comes for free out of the box.  You get automatic sharding/distribution of 
storage and queries.  There's nothing simpler than that.  Distributing a relational 
database is never simple.

I hope this starts to shed some light on what the differences are.


Jonathan Gray
Streamy Inc.

-----Original Message-----
From: Mork0075 [mailto:[EMAIL PROTECTED] Sent: Thursday, August 21, 2008 8:48 AM
To: [EMAIL PROTECTED]; hbase-user@hadoop.apache.org
Subject: Re: Why is scaling HBase much simpler then scaling a relational db?

Thank you, but i still don't got it.

I've read tons of websites and papers, but there's no clear und founded answer "why use BigTable instead of relational databases".

MySQL Cluster seams to offer the same scalabilty and level of abstraction, whithout switching to a non relational pardigm. Lots of blog posts are highly emotional, without answering the core question:

"Why RDBMS don't scale and why something like BigTable do". Often you read something like this:

"They have also built a system called BigTable, which is a Column Oriented Database, which splits a table into columns rather than rows making is much simpler to distribute and parallelize."

Why?

Really confusing ... ;)

Stuart Sierra schrieb:
On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> wrote:
Can you please explain, why someone should use HBase for horizontal
scaling instead of a relational database? One reason for me would be,
that i don't have to implement the sharding logic myself. Are there other?
A slight tangent -- there are various tools that implement sharding
over relational databases like MySQL.  Two that I know of are
DBSlayer,
http://code.nytimes.com/projects/dbslayer
and MySQL Proxy,
http://forge.mysql.com/wiki/MySQL_Proxy

I don't know of any formal comparisons between sharding traditional
database servers and distributed databases like HBase.
-Stuart




Reply via email to