This answer is very, very good!
You example with the friends makes perfectly sense. Can you imagine a
scenario where storing the data in column oriented instead of row
oriented db (so if you will an counterexample) causes such a huge
performance mismatch, like the friends one in row/column comparison?
Can you please provide an example of "good de-normalization" in HBase
and how its held consitent (in your friends example in a relational db,
there would be a cascadingDelete)? As i think of the users table: if i
delete an user with the userid='123', then if have to walk through all
of the other users column-family "friends" to guranty consitency?! Is
de-normalization in HBase only used to avoid joins? Our webapp doenst
use joins at the moment anyway.
Jonathan Gray schrieb:
A few very big differences...
- HBase/BigTable don't have "transactions" in the same way that a relational
database does. While it is possible (and was just recently implemented for HBase, see
HBASE-669) it is not at the core of this design. A major bottleneck of distributed
multi-master relational databases is distributed transactions/locks.
- There's a very big difference between storage of relational/row-oriented
databases and column-oriented databases. For example, if I have a table of
'users' and I need to store friendships between these users... In a relational
database my design is something like:
Table: users(pkey = userid)
Table: friendships(userid,friendid,...) which contains one (or maybe two
depending on how it's impelemented) row for each friendship.
In order to lookup a given users friend, SELECT * FROM friendships WHERE userid
= 'myid';
This query would use an index on the friendships table to retrieve all the
necessary rows. Depending on the relational database you might also be
fetching each and every row (entirely) off of disk to be read. In a sharded
relational database, this would require hitting every node to get whichever
friendships were stored on that node. There's lots of room for optimizations
here but any way you slice it, you're likely pulling non-sequential blocks off
disk. When you add in the overhead of ACID transactions this can get slow.
The cost of this relational query continues to increase as a user adds more
friends. You also begin to have practical limits. If I have millions of
users, each with many thousands of potential friends, the size of these indexes
grow exponentially and things get nasty quickly. Rather than friendships,
imagine I'm storing activity logs of actions taken by users.
In a column-oriented database these things scale continuously with minimal
difference between 10 users and 10,000,000 users, 10 friendships and 10,000
friendships.
Rather than a friendships table, you could just have a friendships column
family in the users table. Each column in that family would contain the ID of
a friend. The value could store anything else you would have stored in the
friendships table in the relational model. As column families are stored
together/sequentially on a per-row basis, reading a user with 1 friend versus a
user with 10,000 friends is virtually the same. The biggest difference is just
in the shipping of this information across the network which is unavoidable.
In this system a user could have 10,000,000 friends. In a relational database
the size of the friendship table would grow massively and the indexes would be
out of control.
It's certainly possible to make relational databases "scale". What that is
about is usually massive optimizations, manual sharding, being very clever about how you
query things, and often de-normalizing. Index bloat and table bloat can thrash a
relational db.
In HBase, de-normalizing is usually a good thing. Storage space is often
considered free (not a large cost associated with storing something in multiple
places). In a relational database this is not so much the case.
In HBase, the sharding and distribution come for free out of the box. With a
relational database you must jump through hoops and often end up implementing
your own sharding/distributed hashing algorithms so you can distribute across
machines.
Yes, you lose the relational primitives. Sometimes you wish you could do a
simple join. But if you get in the right mindset, you learn how to put your
data into this new data model. And in the end the payoffs are huge.
In addition to all this, with HBase/Hadoop you get MapReduce. It's feasible to
implement something like that on top of a distributed relational database but
again the complexity is enormous. With HBase/Hadoop it's a built-in part of
the system, a system which is very intelligent at keeping logic close to data,
etc.
To directly answer your question, it's "simpler" to scale HBase because the
scaling comes for free out of the box. You get automatic sharding/distribution of
storage and queries. There's nothing simpler than that. Distributing a relational
database is never simple.
I hope this starts to shed some light on what the differences are.
Jonathan Gray
Streamy Inc.
-----Original Message-----
From: Mork0075 [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 21, 2008 8:48 AM
To: [EMAIL PROTECTED]; hbase-user@hadoop.apache.org
Subject: Re: Why is scaling HBase much simpler then scaling a relational db?
Thank you, but i still don't got it.
I've read tons of websites and papers, but there's no clear und founded
answer "why use BigTable instead of relational databases".
MySQL Cluster seams to offer the same scalabilty and level of
abstraction, whithout switching to a non relational pardigm. Lots of
blog posts are highly emotional, without answering the core question:
"Why RDBMS don't scale and why something like BigTable do". Often you
read something like this:
"They have also built a system called BigTable, which is a Column
Oriented Database, which splits a table into columns rather than rows
making is much simpler to distribute and parallelize."
Why?
Really confusing ... ;)
Stuart Sierra schrieb:
On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> wrote:
Can you please explain, why someone should use HBase for horizontal
scaling instead of a relational database? One reason for me would be,
that i don't have to implement the sharding logic myself. Are there other?
A slight tangent -- there are various tools that implement sharding
over relational databases like MySQL. Two that I know of are
DBSlayer,
http://code.nytimes.com/projects/dbslayer
and MySQL Proxy,
http://forge.mysql.com/wiki/MySQL_Proxy
I don't know of any formal comparisons between sharding traditional
database servers and distributed databases like HBase.
-Stuart