Re: Join in HBase

bharath vissapragada Thu, 16 Jul 2009 20:51:32 -0700

Hi ,

Did you see the algo of the map-red join i have implemented (i have written
it in the end of my prev mail.). Any comments abt that .. I mean any
improvements and stuff .


On Thu, Jul 16, 2009 at 11:12 PM, Jonathan Gray <[email protected]> wrote:

> Hoberto, Bharath,
>
> Designing these kinds of queries efficiently in HBase means doing multiple
> round-trips, or denormalizing.
>
> That means degrading performance as the query complexity increases, or lots
> of data duplication and a complex write/update process.
>
> In your audit example, you provide the denormalizing solution.  Store the
> fields you need with the data you are querying (details of the user/action
> in the audit table with the audit).  If you have to update those details,
> then you have an extra expense on your write (and you introduce a potential
> synchronization issue without transactions).
>
> The choice about how to solve this really depends on the use case and what
> your requirements are.  Can you ever afford to miss an update in one of the
> denormalized fields, even if it is extremely unlikely?  You can build
> transactional layers on top or you can take a look at TransactionalHBase
> which attempts to do this in a more integrated way.
>
> You also talk about the other approach, running multiple queries in the
> application.  As far as memory pressure in the app is concerned, that would
> really depend on the nature of the join.  It's more an issue of how many
> joins you need to make, and if there's any way to reduce the number of
> queries/trips needed.
>
> If I am pulling the most recent 10 audits, and I need to join each with
> both the User and Actions table, then we're talking about 1 + 10 + 10 total
> queries.  That's not so pretty, but if done in a distributed or threaded way
> may not be too bad.  In the future, I expect more and more tools/frameworks
> available to aid in that process.
>
> Today, this is up to you.
>
> At Streamy, we solve these problems with layers above HBase.  Some of them
> keep lots of stuff in-memory and do complex joins in-memory. Others
> coordinate the multiple queries to HBase, with or without an OCC-style
> transaction.
>
> My suggestion is to start denormalized.  Build naive queries that do lots
> of round-trips.  See what the performance is like under different conditions
> and then go from there.  Perhaps Actions are generally immutable, their name
> never changes, so you could denormalize that field and cut out half of the
> total queries.  Have a pool of threads that grab Users so you can do the
> join in parallel.  Depending on your requirements, this might be sufficient.
>  Otherwise look at more denormalization, or building a thin layer above.
>
> JG
>
>
> Mr Hoberto wrote:
>
>> I can think of two cases that I've been wondering about (I am very new,
>> and
>> am still reading the docs & archives, so I apologize if this has been
>> already covered or if I use the wrong notation...I'm still learning).
>>
>> First case:
>>
>> Tracking audits. In the RDMBS world you'd have the following schema:
>>
>> User (userid, firstname, lastname)
>> Actions (actionid, actionname)
>> Audit (auditTime, userid, actionid)
>>
>> I think the answer in the HBase world is to denormalize the data...have a
>> structure such as:
>>
>> audits (auditid, audittime[timestamp], whowhat[family (firstName,
>> lastname,
>> actionname)])
>>
>> The problem happens, as Bharath says, what if a firstName or LastName
>> needs
>> to be updated? Running a correction on all those denormalized rows is
>> going
>> to be problematic.
>>
>> Alternatively, I suppose you could store the User and Actions tables
>> separately, and keep the audits structure in HBase storing only IDs , and
>> use the website's application layer to "merge" the different data sets
>> together for display on a page. The downside there is if you wind up with
>> a
>> significant amount of users or actions, it'll put a lot of memory pressure
>> on the app servers.
>>
>> Second case:
>>
>> Doing analysis on two time-series based data structures, such as a "PE
>> Ratio"
>>
>> In the RDBS world you'd have two tables:
>>
>> Prices (ticker, date, price)
>> Earnings (ticker, date, earning)
>>
>> Again, I think the answer is denormalizing in the HBase world, with a
>> structure such as:
>>
>> PEs (date, timestamp, PERatio[family (ticker, PEvalue)])
>>
>> The problem here comes, again, with updates. For instance, what if you
>> only
>> have available earnings information on an annual basis, and you've come
>> across a source that has it quarterly....You'll have to update 3/4 of the
>> rows in the denormalized table.
>>
>> Once again, I apologize for any sort of misunderstanding..I'm still
>> learning
>> the concepts behind column stores and map/reduce.
>>
>> -hob
>>
>>
>> On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]>
>> wrote:
>>
>>  The answer is, it depends.
>>>
>>> What are the details of what you are trying to join?  Is it just a simple
>>> 1-to-1 join, or 1-to-many or what?  At a minimum, the join would require
>>> two
>>> round-trips.  However, 0.20 can do simple queries in the 1-10ms
>>> time-range
>>> (closer to 1ms when the blocks are already cached).
>>>
>>> The comparison to an RDBMS cannot be made directly because a single-node
>>> RDBMS with a smallish table will be quite fast at simple index-based
>>> joins.
>>>  I would guess that unloaded, single machine performance of this join
>>> operation would be much faster in an RDBMS.
>>>
>>> But if your table has millions or billions of rows, it's a different
>>> situation.  HBase performance will stay nearly constant as your table
>>> increases, as long as you have the nodes to support your dataset and the
>>> load.
>>>
>>> What are your targets for time (sub 100ms? 10ms?), and what are the
>>> details
>>> of what you're joining?
>>>
>>> As far as code is concerned, there is not much to a simple join, so I'm
>>> not
>>> sure how helpful it would be.  If you give some detail perhaps I can
>>> provide
>>> some pseudo-code for you.
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  JG thanks for ur reply,
>>>>
>>>> Actually iam trying to implement a realtime join of two tables on HBase
>>>> .
>>>> Actually i tried the idea of denormalizing the tables to avoid the Joins
>>>> ,
>>>> but when we do that Updating the data is really difficult .  I
>>>> understand
>>>> that the features i am trying to implement are that of a RDBMS and HBase
>>>> is
>>>> used for a different purpose . Even then i want (rather i would like to
>>>> try)
>>>> to store the data  the data in HBase and implement Joins so that i
>>>>  could
>>>> test its performance and if its effective (atleast on large number of
>>>> nodes)
>>>> , it maybe of somehelp to me . I know some ppl have already tried this .
>>>> If
>>>> anyone of already tried this can you just tellme how the results are ..
>>>> i
>>>> mean are they good , when compared to RDBMS join on a single machine ...
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]>
>>>> wrote:
>>>>
>>>>  Bharath,
>>>>
>>>>> You need to outline what your actual requirements are if you want more
>>>>> help.  Open-ended questions that just ask for code are usually not
>>>>> answered.
>>>>>
>>>>> What exactly are you trying to join?  Does this join need to happen in
>>>>> "realtime" or is this part of a batch process?
>>>>>
>>>>> Could you denormalize your data to prevent needing the join at runtime?
>>>>>
>>>>> If you provide details about exactly what your data/schema is like (or
>>>>> a
>>>>> similar example if this is confidential), then many of us are more than
>>>>> happy to help you figure out what approach my work best.
>>>>>
>>>>> When working with HBase, figuring out how you want to pull your data
>>>>> out
>>>>> is
>>>>> key to how you want to put the data in.
>>>>>
>>>>> JG
>>>>>
>>>>>
>>>>> bharath vissapragada wrote:
>>>>>
>>>>>  Amandeep , can you tell me what kinds of joins u have implemented ?
>>>>> and
>>>>>
>>>>>> which works the best (based on observation ).. Can u show us the
>>>>>> source
>>>>>> code
>>>>>> (if possible)
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  I've been doing joins by writing my own MR jobs. That works best.
>>>>>>
>>>>>>  Not tried cascading yet.
>>>>>>>
>>>>>>> -ak
>>>>>>>
>>>>>>> On 7/14/09, bharath vissapragada <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Thats fine .. I know that hbase has completely different usage
>>>>>>>
>>>>>>>> compared
>>>>>>>>
>>>>>>>>  to
>>>>>>>>
>>>>>>>  SQL .. But for my application there is some kind of dependency
>>>>>>>
>>>>>>>> involved
>>>>>>>> among the tables . So i need to implement a Join . I wanted to know
>>>>>>>>
>>>>>>>>  whether
>>>>>>>>
>>>>>>>  there is some kind of implementation already
>>>>>>>
>>>>>>>> ..
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>  HBase != SQL.
>>>>>>>
>>>>>>>> You might want map reduce or cascading.
>>>>>>>>>
>>>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath
>>>>>>>>> vissapragada<[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>  Hi all ,
>>>>>>>>>
>>>>>>>>>> I want to join(similar to relational databases join) two tables in
>>>>>>>>>>
>>>>>>>>>>  HBase
>>>>>>>>>>
>>>>>>>>> .
>>>>>>>>
>>>>>>>>  Can anyone tell me whether  it is already implemented in the source
>>>>>>>>> !
>>>>>>>>>
>>>>>>>>>> Thanks in Advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>>
>>>>>>>>>
>>>>>>> Amandeep Khurana
>>>>>>> Computer Science Graduate Student
>>>>>>> University of California, Santa Cruz
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>

Re: Join in HBase

Reply via email to