[Hadoop Wiki] Trivial Update of "Hbase/FAQ" by Vaibhav Puranik

Apache Wiki Tue, 31 Mar 2009 09:42:09 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by Vaibhav Puranik:
http://wiki.apache.org/hadoop/Hbase/FAQ

------------------------------------------------------------------------------
  
  Rather than a friendships table, you could just have a friendships column 
family in the users table. Each column in that family would contain the ID of a 
friend. The value could store anything else you would have stored in the 
friendships table in the relational model. As column families are stored 
together/sequentially on a per-row basis, reading a user with 1 friend versus a 
user with 10,000 friends is virtually the same. The biggest difference is just 
in the shipping of this information across the network which is unavoidable. In 
this system a user could have 10,000,000 friends. In a relational database the 
size of the friendship table would grow massively and the indexes would be out 
of control.
  
- '''Q: Can you please provide an example of "good de-normalization" in HBase 
and how its held consitent (in your friends example in a relational db, there 
would be a cascadingDelete)? As i think of the users table: if i delete an user 
with the userid='123', then if have to walk through all of the other users 
column-family "friends" to guranty consitency?! Is de-normalization in HBase 
only used to avoid joins? Our webapp doenst use joins at the moment anyway.'''
+ '''Q: Can you please provide an example of "good de-normalization" in HBase 
and how its held consistent (in your friends example in a relational db, there 
would be a cascadingDelete)? As i think of the users table: if i delete an user 
with the userid='123', then if have to walk through all of the other users 
column-family "friends" to guaranty consistency?! Is de-normalization in HBase 
only used to avoid joins? Our webapp doenst use joins at the moment anyway.'''
  
  You lose any concept of foreign keys. You have a primary key, that's it. No
  secondary keys/indexes, no foreign keys.
  
- Another example of "good denormalization" would be something like storing a 
users "favorite pages". If we want to query this data in two ways: for a given 
user, all of his favorites. Or, for a given favorite, all of the users who have
+ It's the responsibility of your application to handle something like deleting 
a friend and cascading to the friendships. Again, typical small web apps are 
far simpler to write using SQL, you become responsible for some of the things 
that were once handled for you.
+ 
- it as a favorite. Relational database would probably have tables for users, 
favorites, and userfavorites. Each link would be stored in one row in the 
userfavorites table. We would have indexes on both 'userid' and 'favoriteid' 
and could thus query it in both ways described above. In HBase we'd probably 
put a column in both the users table and the favorites table, there would be no 
link table.
+ Another example of "good denormalization" would be something like storing a 
users "favorite pages". If we want to query this data in two ways: for a given 
user, all of his favorites. Or, for a given favorite, all of the users who have 
it as a favorite. Relational database would probably have tables for users, 
favorites, and userfavorites. Each link would be stored in one row in the 
userfavorites table. We would have indexes on both 'userid' and 'favoriteid' 
and could thus query it in both ways described above. In HBase we'd probably 
put a column in both the users table and the favorites table, there would be no 
link table.
  
  That would be a very efficient query in both architectures, with relational 
performing better much better with small datasets but less so with a large 
dataset.
  
  Now asking for the favorites of these 10 users. That starts to get tricky in 
HBase and will undoubtedly suffer worse from random reading. The flexibility of 
SQL allows us to just ask the database for the answer to that question. In a
- small dataset it will come up with a decent solution, and return the results 
to you in a matter of milliseconds. Now let's make that userfavorites table a 
few billion rows, and the number of users you're asking for a couple thousand. 
The query planner will come up with something but things will fall down and it 
will end up taking forever. The worst problem will be in the index bloat. 
Insertions to this link table will start to take a very long time. HBase will
+ small dataset it will come up with a decent solution, and return the results 
to you in a matter of milliseconds. Now let's make that userfavorites table a 
few billion rows, and the number of users you're asking for a couple thousand. 
The query planner will come up with something but things will fall down and it 
will end up taking forever. The worst problem will be in the index bloat. 
Insertions to this link table will start to take a very long time. HBase will 
perform virtually the same as it did on the small table, if not better because 
of superior region distribution.
- perform virtually the same as it did on the small table, if not better 
because of superior region distribution.
  
  '''Q:[Michael Dagaev] How would you design an Hbase table for many-to-many 
association between two entities, for example Student and Course?'''

[Hadoop Wiki] Trivial Update of "Hbase/FAQ" by Vaibhav Puranik

Reply via email to