[Hadoop Wiki] Update of "Hbase/FAQ_General" by DougMeil

Apache Wiki Sat, 06 Aug 2011 11:54:29 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/FAQ_General" page has been changed by DougMeil:
http://wiki.apache.org/hadoop/Hbase/FAQ_General?action=diff&rev1=1&rev2=2

- Describe Hbase/FAQ_General here.
+ FAQ - General Questions
  
+ == Questions ==
+  1. [[#1|When would I use HBase?]]
+  1. [[#2|Can someone give an example of basic API-usage going against hbase?]]
+  1. [[#3|What other hbase-like applications are there out there?]]
+  1. [[#8|How do I access HBase from my Ruby/Python/Perl/PHP/etc. 
application?]]on?]]
+  1. [[#14|Can HBase development be done on windows?]]
+  1. [[#15|Please explain HBase version numbering?]]
+  1. [[#16|What version of Hadoop do I need to run HBase?]]
+  1. [[#18|Are there any schema design examples?]]
+   
+ == Answers ==
+ 
+ 
+ '''1. <<Anchor(1)>> When would I use HBase?'''
+ 
+ See [[http://blog.rapleaf.com/dev/?p=26|Bryan Duxbury's post]] on this topic.
+ 
+ 
+ '''2. <<Anchor(2)>> Can someone give an example of basic API-usage going 
against hbase?'''
+ 
+ See the Data Model section in the HBase Book:  
http://hbase.apache.org/book.html#datamodel
+ 
+ See the [[Hbase|wiki home page]] for sample code accessing HBase from other 
than java.
+ 
+ '''3. <<Anchor(3)>> What other hbase-like applications are there out there?'''
+ 
+ Broadly speaking, there are many.  One place to start your search is here 
[[http://blog.oskarsson.nu/2009/06/nosql-debrief.html|nosql]].
+ 
+ '''8. <<Anchor(8)>> How do I access Hbase from my Ruby/Python/Perl/PHP/etc. 
application?'''
+ 
+ See non-java access on [[Hbase|HBase wiki home page]]
+ 
+ 
+ '''14. <<Anchor(14)>> Can HBase development be done on windows?'''
+ 
+ See the the Getting Started section in the HBase Book:  
http://hbase.apache.org/book.html#getting_started
+ 
+ '''15. <<Anchor(15)>> Please explain HBase version numbering?'''
+ 
+ See [[http://wiki.apache.org/hadoop/Hbase/HBaseVersions|HBase Versions since 
0.20.x]].  The below is left in place for the historians.
+ 
+ Originally HBase lived under src/contrib in Hadoop Core.  The HBase version 
was that of the hosting Hadoop.  The last HBase version that bundled under 
contrib was part of Hadoop 0.16.1 (March of 2008).
+ 
+ The first HBase Hadoop subproject release was versioned 0.1.0.  Subsequent 
releases went at least as far as 0.2.1 (September 2008).
+ 
+ In August of 2008, consensus had it that since HBase depends on a particular 
Hadoop Core version, the HBase major+minor versions would from now on mirror 
that of the Hadoop Core version HBase depends on.  The first HBase release to 
take on this new versioning regimine was 0.18.0 HBase; HBase 0.18.0 depends on 
Hadoop 0.18.x.
+ 
+ Sorry for any confusion caused.
+ 
+ '''16. <<Anchor(16)>> What version of Hadoop do I need to run HBase?'''
+ 
+ Different versions of HBase require different versions of Hadoop.  Consult 
the table below to find which version of Hadoop you will need:
+ 
+ ||'''HBase Release Number'''||'''Hadoop Release Number'''||
+ ||0.1.x||0.16.x||
+ ||0.2.x||0.17.x||
+ ||0.18.x||0.18.x||
+ ||0.19.x||0.19.x||
+ ||0.20.x||0.20.x||
+ 
+ Releases of Hadoop can be found 
[[http://hadoop.apache.org/core/releases.html|here]].  We recommend using the 
most recent version of Hadoop possible, as it will contain the most bug fixes.
+ 
+ Note that HBase-0.2.x can be made to work on Hadoop-0.18.x.  HBase-0.2.x 
ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile 
Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with 
the jars from Hadoop-0.18.x.
+ 
+ Also note that after HBase-0.2.x, the HBase release numbering schema will 
change to align with the Hadoop release number on which it depends.
+ 
+ 
+ '''18. <<Anchor(18)>> Are there any Schema Design examples?'''
+ 
+ See 
[[http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies|HBase
 Schema Design -- Case Studies]] by Evan(Qingyan) Liu or the following text 
taken from Jonathan Gray's mailing list posts.
+ 
+ - There's a very big difference between storage of relational/row-oriented 
databases and column-oriented databases. For example, if I have a table of 
'users' and I need to store friendships between these users... In a relational 
database my design is something like:
+ 
+ Table: users(pkey = userid) Table: friendships(userid,friendid,...) which 
contains one (or maybe two depending on how it's impelemented) row for each 
friendship.
+ 
+ In order to lookup a given users friend, SELECT * FROM friendships WHERE 
userid = 'myid';
+ 
+ The cost of this relational query continues to increase as a user adds more 
friends. You also begin to have practical limits. If I have millions of users, 
each with many thousands of potential friends, the size of these indexes grow 
exponentially and things get nasty quickly. Rather than friendships, imagine 
I'm storing activity logs of actions taken by users.
+ 
+ In a column-oriented database these things scale continuously with minimal 
difference between 10 users and 10,000,000 users, 10 friendships and 10,000 
friendships.
+ 
+ Rather than a friendships table, you could just have a friendships column 
family in the users table. Each column in that family would contain the ID of a 
friend. The value could store anything else you would have stored in the 
friendships table in the relational model. As column families are stored 
together/sequentially on a per-row basis, reading a user with 1 friend versus a 
user with 10,000 friends is virtually the same. The biggest difference is just 
in the shipping of this information across the network which is unavoidable. In 
this system a user could have 10,000,000 friends. In a relational database the 
size of the friendship table would grow massively and the indexes would be out 
of control.
+ 
+ '''Q: Can you please provide an example of "good de-normalization" in HBase 
and how its held consistent (in your friends example in a relational db, there 
would be a cascadingDelete)? As I think of the users table: if I delete an user 
with the userid='123', do I have to walk through all of the other users 
column-family "friends" to guaranty consistency?! Is de-normalization in HBase 
only used to avoid joins? Our webapp doesn't use joins at the moment anyway.'''
+ 
+ You lose any concept of foreign keys. You have a primary key, that's it. No
+ secondary keys/indexes, no foreign keys.
+ 
+ It's the responsibility of your application to handle something like deleting 
a friend and cascading to the friendships. Again, typical small web apps are 
far simpler to write using SQL, you become responsible for some of the things 
that were once handled for you.
+ 
+ Another example of "good denormalization" would be something like storing a 
users "favorite pages". If we want to query this data in two ways: for a given 
user, all of his favorites. Or, for a given favorite, all of the users who have 
it as a favorite. Relational database would probably have tables for users, 
favorites, and userfavorites. Each link would be stored in one row in the 
userfavorites table. We would have indexes on both 'userid' and 'favoriteid' 
and could thus query it in both ways described above. In HBase we'd probably 
put a column in both the users table and the favorites table, there would be no 
link table.
+ 
+ That would be a very efficient query in both architectures, with relational 
performing better much better with small datasets but less so with a large 
dataset.
+ 
+ Now asking for the favorites of these 10 users. That starts to get tricky in 
HBase and will undoubtedly suffer worse from random reading. The flexibility of 
SQL allows us to just ask the database for the answer to that question. In a
+ small dataset it will come up with a decent solution, and return the results 
to you in a matter of milliseconds. Now let's make that userfavorites table a 
few billion rows, and the number of users you're asking for a couple thousand. 
The query planner will come up with something but things will fall down and it 
will end up taking forever. The worst problem will be in the index bloat. 
Insertions to this link table will start to take a very long time. HBase will 
perform virtually the same as it did on the small table, if not better because 
of superior region distribution.
+ 
+ '''Q:[Michael Dagaev] How would you design an Hbase table for many-to-many 
association between two entities, for example Student and Course?'''
+ 
+ I would define two tables:
+ 
+ Student: student id student data (name, address, ...) courses (use course ids 
as column qualifiers here)
+ Course: course id course data (name, syllabus, ...) students (use student ids 
as column qualifiers here)
+ 
+ Does it make sense? 
+ 
+ A[Jonathan Gray] : 
+ Your design does make sense.
+ 
+ As you said, you'd probably have two column-families in each of the Student 
and Course tables. One for the data, another with a column per student or 
course.
+ For example, a student row might look like:
+ Student :
+ id/row/key = 1001 
+ data:name = Student Name 
+ data:address = 123 ABC St 
+ courses:2001 = (If you need more information about this association, for 
example, if they are on the waiting list) 
+ courses:2002 = ...
+ 
+ This schema gives you fast access to the queries, show all classes for a 
student (student table, courses family), or all students for a class (courses 
table, students family). 
+

[Hadoop Wiki] Update of "Hbase/FAQ_General" by DougMeil

Reply via email to