RE: implementing join on two Hbase tables

Jonathan Gray Fri, 05 Dec 2008 10:34:38 -0800

I'm not aware of anything that is completely equipped for the task, however
this could be done more simply with one of the Hadoop MapReduce tools.


My personal favorite is Cascading (http://www.cascading.org) by Chris
Wensel.  This can help you with doing something like reading in two
different tables from two different Maps and bringing them together.
Unfortunately, there is not yet an HBase Tap.  If you're interested in
developing one, I have been told that it should not be difficult.  Check out
#cascading on freenode and you should be able to get some help.  If you go
down that route, please let me know because I'm interested in an HBase Tap
as well but have not had the time to work on it.

Hive and Pig are other projects that help with this, but they also do not
have HBase hooks yet (that I'm aware of).

You might also consider something like Pigi (http://www.pigi-project.org),
which is an ORM.  It supports indexing and searching, unsure if there are
any mechanisms for joins available or planned.

Otherwise, you'll need to write your own jobs.  You'd need probably three
different MR jobs.  Two that Map from each of the HBase tables you're
interested in.  Then another job that would read from combined output of
those two jobs and perform the join.  You might use the Map->Reduce sort
step to perform the join if possible, depends on the details of what you
want to do.  If you go down this path, you can certainly get plenty of help
from this list or the IRC channel #hbase as this would be very useful to the
community.

JG


> -----Original Message-----
> From: abhinit [mailto:[EMAIL PROTECTED]
> Sent: Friday, December 05, 2008 2:32 AM
> To: [email protected]
> Subject: implementing join on two Hbase tables
> 
> I am trying to implement hash-join and nested join on two Hbase tables.
> However, I am stuck.
> 
> I came across the package *org.apache.hadoop.mapred.join* which joins
> two sorted datasets before map. However, I want to implement joins
> using
> map/reduce methods so that I have more control on how to join the data.
> 
> I found the package *org.apache.hadoop.contrib.utils.join* after a bit
> of
> searching
> which has something I am looking for (not too sure as I have not read
> the
> code completely).
> It would be great if someone who has used this package can give me a
> pointer
> on my problem,
> 
> Is there a way I can take two tables as input in TableMap's map method?
> (my
> guess is no)
> If not, does the current hadoop/hbase implementation provide features
> for
> implementing user-defined joins
> 
> Thanks a lot
> -Abhinit

RE: implementing join on two Hbase tables

Reply via email to