I'am still really interested in these three questions :) > You example with the friends makes perfectly sense. Can you imagine a > scenario where storing the data in column oriented instead of row > oriented db (so if you will an counterexample) causes such a huge > performance mismatch, like the friends one in row/column comparison?
> Can you please provide an example of "good de-normalization" in HBase > and how its held consitent (in your friends example in a relational db, > there would be a cascadingDelete)? As i think of the users table: if i > delete an user with the userid='123', then if have to walk through all > of the other users column-family "friends" to guranty consitency?! Is > de-normalization in HBase only used to avoid joins? Our webapp doenst > use joins at the moment anyway. > As you describe it, its a problem of implementation. BigTable is > designed to scale, there are routines to shard the data, desitribute > it to the pool of connected servers. Could MySQL perhaps decide > tomorrow to implement something similar or does the relational model > avoids this? > > Jonathan Gray schrieb: >> A few very big differences... >> >> - HBase/BigTable don't have "transactions" in the same way that a >> relational database does. While it is possible (and was just recently >> implemented for HBase, see HBASE-669) it is not at the core of this >> design. A major bottleneck of distributed multi-master relational >> databases is distributed transactions/locks. >> >> - There's a very big difference between storage of >> relational/row-oriented databases and column-oriented databases. For >> example, if I have a table of 'users' and I need to store friendships >> between these users... In a relational database my design is something >> like: >> >> Table: users(pkey = userid) >> Table: friendships(userid,friendid,...) which contains one (or maybe >> two depending on how it's impelemented) row for each friendship. >> >> In order to lookup a given users friend, SELECT * FROM friendships >> WHERE userid = 'myid'; >> >> This query would use an index on the friendships table to retrieve all >> the necessary rows. Depending on the relational database you might >> also be fetching each and every row (entirely) off of disk to be >> read. In a sharded relational database, this would require hitting >> every node to get whichever friendships were stored on that node. >> There's lots of room for optimizations here but any way you slice it, >> you're likely pulling non-sequential blocks off disk. When you add in >> the overhead of ACID transactions this can get slow. >> >> The cost of this relational query continues to increase as a user adds >> more friends. You also begin to have practical limits. If I have >> millions of users, each with many thousands of potential friends, the >> size of these indexes grow exponentially and things get nasty >> quickly. Rather than friendships, imagine I'm storing activity logs >> of actions taken by users. >> >> In a column-oriented database these things scale continuously with >> minimal difference between 10 users and 10,000,000 users, 10 >> friendships and 10,000 friendships. >> >> Rather than a friendships table, you could just have a friendships >> column family in the users table. Each column in that family would >> contain the ID of a friend. The value could store anything else you >> would have stored in the friendships table in the relational model. >> As column families are stored together/sequentially on a per-row >> basis, reading a user with 1 friend versus a user with 10,000 friends >> is virtually the same. The biggest difference is just in the shipping >> of this information across the network which is unavoidable. In this >> system a user could have 10,000,000 friends. In a relational database >> the size of the friendship table would grow massively and the indexes >> would be out of control. >> >> >> It's certainly possible to make relational databases "scale". What >> that is about is usually massive optimizations, manual sharding, being >> very clever about how you query things, and often de-normalizing. >> Index bloat and table bloat can thrash a relational db. >> >> In HBase, de-normalizing is usually a good thing. Storage space is >> often considered free (not a large cost associated with storing >> something in multiple places). In a relational database this is not >> so much the case. >> >> In HBase, the sharding and distribution come for free out of the box. >> With a relational database you must jump through hoops and often end >> up implementing your own sharding/distributed hashing algorithms so >> you can distribute across machines. >> >> Yes, you lose the relational primitives. Sometimes you wish you could >> do a simple join. But if you get in the right mindset, you learn how >> to put your data into this new data model. And in the end the payoffs >> are huge. >> >> >> In addition to all this, with HBase/Hadoop you get MapReduce. It's >> feasible to implement something like that on top of a distributed >> relational database but again the complexity is enormous. With >> HBase/Hadoop it's a built-in part of the system, a system which is >> very intelligent at keeping logic close to data, etc. >> >> >> To directly answer your question, it's "simpler" to scale HBase >> because the scaling comes for free out of the box. You get automatic >> sharding/distribution of storage and queries. There's nothing simpler >> than that. Distributing a relational database is never simple. >> >> I hope this starts to shed some light on what the differences are. >> >> >> Jonathan Gray >> Streamy Inc. >> >> -----Original Message----- >> From: Mork0075 [mailto:[EMAIL PROTECTED] Sent: Thursday, August >> 21, 2008 8:48 AM >> To: [EMAIL PROTECTED]; hbase-user@hadoop.apache.org >> Subject: Re: Why is scaling HBase much simpler then scaling a >> relational db? >> >> Thank you, but i still don't got it. >> >> I've read tons of websites and papers, but there's no clear und >> founded answer "why use BigTable instead of relational databases". >> >> MySQL Cluster seams to offer the same scalabilty and level of >> abstraction, whithout switching to a non relational pardigm. Lots of >> blog posts are highly emotional, without answering the core question: >> >> "Why RDBMS don't scale and why something like BigTable do". Often you >> read something like this: >> >> "They have also built a system called BigTable, which is a Column >> Oriented Database, which splits a table into columns rather than rows >> making is much simpler to distribute and parallelize." >> >> Why? >> >> Really confusing ... ;) >> >> Stuart Sierra schrieb: >>> On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> >>> wrote: >>>> Can you please explain, why someone should use HBase for horizontal >>>> scaling instead of a relational database? One reason for me would be, >>>> that i don't have to implement the sharding logic myself. Are there >>>> other? >>> A slight tangent -- there are various tools that implement sharding >>> over relational databases like MySQL. Two that I know of are >>> DBSlayer, >>> http://code.nytimes.com/projects/dbslayer >>> and MySQL Proxy, >>> http://forge.mysql.com/wiki/MySQL_Proxy >>> >>> I don't know of any formal comparisons between sharding traditional >>> database servers and distributed databases like HBase. >>> -Stuart >>> >> >> > >