Re: Join in HBase

bharath vissapragada Thu, 16 Jul 2009 09:26:16 -0700

Actually i wanted to 1st implement  a simple 1-1 join , and if it shows some
performance gain , i wanted to extend it to 1-many by making some
modifications in the algo. I wanted to run this join in the realtime
environment containing tables of billions of rows and also data increasing
day by day . i wanted to take-up this approach because the major problem
with RDBMS is scaling of data (I have read this from many websites and this
is what led to the development of key-value databases).  Actually when i
asked you the question about comparing the performance of RDBMS and Join in
HBASE i meant that the amount of data is huge (containing billions of rows )
and the difference is in a manchine running RDBMS data resides on a single
machine and the processing is done on the same machine where as in the
latter approach (which i wanted to implement) data resides in "hbase" and
join is performed on different machines simultaneously (as in map-reduce) .
Actually  i have implemented a map-reduce variant of HashJoin on hadoop
using just HDFS (i mean data is stored in HDFS files and not as in HBASE ).
The details of that join are as follows


consider a simple 1-1 join

*Map phase  :*
-> read in the input files (inner and outer relations ), parse them and emit
the column value (which is participating in the join)
-> After map phase is over all the values  with same column value go to same
system . (we can employ simple methods to know which tuple is frm inner
relation and which is frm outer relation )
*Reduce Phase :*
Since the matching columns will be on same system they can be joined and
written to disk

Can you tell me any changes i have to make inorder to implement a similar
algo in hbase (which is column-oriented and varies greatly in IO/s as all
the column families are grouped together)

Please reply :) and btw sorry for a looooooong mail :)

On Thu, Jul 16, 2009 at 8:49 PM, Jonathan Gray <[email protected]> wrote:

> The answer is, it depends.
>
> What are the details of what you are trying to join?  Is it just a simple
> 1-to-1 join, or 1-to-many or what?  At a minimum, the join would require two
> round-trips.  However, 0.20 can do simple queries in the 1-10ms time-range
> (closer to 1ms when the blocks are already cached).
>
> The comparison to an RDBMS cannot be made directly because a single-node
> RDBMS with a smallish table will be quite fast at simple index-based joins.
>  I would guess that unloaded, single machine performance of this join
> operation would be much faster in an RDBMS.
>
> But if your table has millions or billions of rows, it's a different
> situation.  HBase performance will stay nearly constant as your table
> increases, as long as you have the nodes to support your dataset and the
> load.
>
> What are your targets for time (sub 100ms? 10ms?), and what are the details
> of what you're joining?
>
> As far as code is concerned, there is not much to a simple join, so I'm not
> sure how helpful it would be.  If you give some detail perhaps I can provide
> some pseudo-code for you.
>
> JG
>
>
> bharath vissapragada wrote:
>
>> JG thanks for ur reply,
>>
>> Actually iam trying to implement a realtime join of two tables on HBase .
>> Actually i tried the idea of denormalizing the tables to avoid the Joins ,
>> but when we do that Updating the data is really difficult .  I understand
>> that the features i am trying to implement are that of a RDBMS and HBase
>> is
>> used for a different purpose . Even then i want (rather i would like to
>> try)
>> to store the data  the data in HBase and implement Joins so that i  could
>> test its performance and if its effective (atleast on large number of
>> nodes)
>> , it maybe of somehelp to me . I know some ppl have already tried this .
>> If
>> anyone of already tried this can you just tellme how the results are .. i
>> mean are they good , when compared to RDBMS join on a single machine ...
>>
>> Thanks
>>
>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> wrote:
>>
>>  Bharath,
>>>
>>> You need to outline what your actual requirements are if you want more
>>> help.  Open-ended questions that just ask for code are usually not
>>> answered.
>>>
>>> What exactly are you trying to join?  Does this join need to happen in
>>> "realtime" or is this part of a batch process?
>>>
>>> Could you denormalize your data to prevent needing the join at runtime?
>>>
>>> If you provide details about exactly what your data/schema is like (or a
>>> similar example if this is confidential), then many of us are more than
>>> happy to help you figure out what approach my work best.
>>>
>>> When working with HBase, figuring out how you want to pull your data out
>>> is
>>> key to how you want to put the data in.
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  Amandeep , can you tell me what kinds of joins u have implemented ? and
>>>> which works the best (based on observation ).. Can u show us the source
>>>> code
>>>> (if possible)
>>>>
>>>> Thanks in advance
>>>>
>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
>>>> wrote:
>>>>
>>>>  I've been doing joins by writing my own MR jobs. That works best.
>>>>
>>>>> Not tried cascading yet.
>>>>>
>>>>> -ak
>>>>>
>>>>> On 7/14/09, bharath vissapragada <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Thats fine .. I know that hbase has completely different usage
>>>>>> compared
>>>>>>
>>>>>>  to
>>>>>
>>>>>  SQL .. But for my application there is some kind of dependency
>>>>>> involved
>>>>>> among the tables . So i need to implement a Join . I wanted to know
>>>>>>
>>>>>>  whether
>>>>>
>>>>>  there is some kind of implementation already
>>>>>> ..
>>>>>>
>>>>>> Thanks
>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>
>>>>>>
>>>>>>  wrote:
>>>>>
>>>>>  HBase != SQL.
>>>>>>
>>>>>>> You might want map reduce or cascading.
>>>>>>>
>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath
>>>>>>> vissapragada<[email protected]> wrote:
>>>>>>>
>>>>>>>  Hi all ,
>>>>>>>>
>>>>>>>> I want to join(similar to relational databases join) two tables in
>>>>>>>>
>>>>>>>>  HBase
>>>>>>>
>>>>>> .
>>>>>>
>>>>>>> Can anyone tell me whether  it is already implemented in the source !
>>>>>>>>
>>>>>>>> Thanks in Advance
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>
>>>>>
>>>>> Amandeep Khurana
>>>>> Computer Science Graduate Student
>>>>> University of California, Santa Cruz
>>>>>
>>>>>
>>>>>
>>

Re: Join in HBase

Reply via email to