A few very big differences... - HBase/BigTable don't have "transactions" in the same way that a relational database does. While it is possible (and was just recently implemented for HBase, see HBASE-669) it is not at the core of this design. A major bottleneck of distributed multi-master relational databases is distributed transactions/locks.
- There's a very big difference between storage of relational/row-oriented databases and column-oriented databases. For example, if I have a table of 'users' and I need to store friendships between these users... In a relational database my design is something like: Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or maybe two depending on how it's impelemented) row for each friendship. In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = 'myid'; This query would use an index on the friendships table to retrieve all the necessary rows. Depending on the relational database you might also be fetching each and every row (entirely) off of disk to be read. In a sharded relational database, this would require hitting every node to get whichever friendships were stored on that node. There's lots of room for optimizations here but any way you slice it, you're likely pulling non-sequential blocks off disk. When you add in the overhead of ACID transactions this can get slow. The cost of this relational query continues to increase as a user adds more friends. You also begin to have practical limits. If I have millions of users, each with many thousands of potential friends, the size of these indexes grow exponentially and things get nasty quickly. Rather than friendships, imagine I'm storing activity logs of actions taken by users. In a column-oriented database these things scale continuously with minimal difference between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships. Rather than a friendships table, you could just have a friendships column family in the users table. Each column in that family would contain the ID of a friend. The value could store anything else you would have stored in the friendships table in the relational model. As column families are stored together/sequentially on a per-row basis, reading a user with 1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is just in the shipping of this information across the network which is unavoidable. In this system a user could have 10,000,000 friends. In a relational database the size of the friendship table would grow massively and the indexes would be out of control. It's certainly possible to make relational databases "scale". What that is about is usually massive optimizations, manual sharding, being very clever about how you query things, and often de-normalizing. Index bloat and table bloat can thrash a relational db. In HBase, de-normalizing is usually a good thing. Storage space is often considered free (not a large cost associated with storing something in multiple places). In a relational database this is not so much the case. In HBase, the sharding and distribution come for free out of the box. With a relational database you must jump through hoops and often end up implementing your own sharding/distributed hashing algorithms so you can distribute across machines. Yes, you lose the relational primitives. Sometimes you wish you could do a simple join. But if you get in the right mindset, you learn how to put your data into this new data model. And in the end the payoffs are huge. In addition to all this, with HBase/Hadoop you get MapReduce. It's feasible to implement something like that on top of a distributed relational database but again the complexity is enormous. With HBase/Hadoop it's a built-in part of the system, a system which is very intelligent at keeping logic close to data, etc. To directly answer your question, it's "simpler" to scale HBase because the scaling comes for free out of the box. You get automatic sharding/distribution of storage and queries. There's nothing simpler than that. Distributing a relational database is never simple. I hope this starts to shed some light on what the differences are. Jonathan Gray Streamy Inc. -----Original Message----- From: Mork0075 [mailto:[EMAIL PROTECTED] Sent: Thursday, August 21, 2008 8:48 AM To: [EMAIL PROTECTED]; hbase-user@hadoop.apache.org Subject: Re: Why is scaling HBase much simpler then scaling a relational db? Thank you, but i still don't got it. I've read tons of websites and papers, but there's no clear und founded answer "why use BigTable instead of relational databases". MySQL Cluster seams to offer the same scalabilty and level of abstraction, whithout switching to a non relational pardigm. Lots of blog posts are highly emotional, without answering the core question: "Why RDBMS don't scale and why something like BigTable do". Often you read something like this: "They have also built a system called BigTable, which is a Column Oriented Database, which splits a table into columns rather than rows making is much simpler to distribute and parallelize." Why? Really confusing ... ;) Stuart Sierra schrieb: > On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> wrote: >> Can you please explain, why someone should use HBase for horizontal >> scaling instead of a relational database? One reason for me would be, >> that i don't have to implement the sharding logic myself. Are there other? > > A slight tangent -- there are various tools that implement sharding > over relational databases like MySQL. Two that I know of are > DBSlayer, > http://code.nytimes.com/projects/dbslayer > and MySQL Proxy, > http://forge.mysql.com/wiki/MySQL_Proxy > > I don't know of any formal comparisons between sharding traditional > database servers and distributed databases like HBase. > -Stuart >