Re: Join in HBase

Jonathan Gray Thu, 16 Jul 2009 10:43:04 -0700

Hoberto, Bharath,

Designing these kinds of queries efficiently in HBase means doingmultiple round-trips, or denormalizing.

That means degrading performance as the query complexity increases, orlots of data duplication and a complex write/update process.

In your audit example, you provide the denormalizing solution. Storethe fields you need with the data you are querying (details of theuser/action in the audit table with the audit). If you have to updatethose details, then you have an extra expense on your write (and youintroduce a potential synchronization issue without transactions).

The choice about how to solve this really depends on the use case andwhat your requirements are. Can you ever afford to miss an update inone of the denormalized fields, even if it is extremely unlikely? Youcan build transactional layers on top or you can take a look atTransactionalHBase which attempts to do this in a more integrated way.

You also talk about the other approach, running multiple queries in theapplication. As far as memory pressure in the app is concerned, thatwould really depend on the nature of the join. It's more an issue ofhow many joins you need to make, and if there's any way to reduce thenumber of queries/trips needed.

If I am pulling the most recent 10 audits, and I need to join each withboth the User and Actions table, then we're talking about 1 + 10 + 10total queries. That's not so pretty, but if done in a distributed orthreaded way may not be too bad. In the future, I expect more and moretools/frameworks available to aid in that process.


Today, this is up to you.

At Streamy, we solve these problems with layers above HBase. Some ofthem keep lots of stuff in-memory and do complex joins in-memory.Others coordinate the multiple queries to HBase, with or without anOCC-style transaction.

My suggestion is to start denormalized. Build naive queries that dolots of round-trips. See what the performance is like under differentconditions and then go from there. Perhaps Actions are generallyimmutable, their name never changes, so you could denormalize that fieldand cut out half of the total queries. Have a pool of threads that grabUsers so you can do the join in parallel. Depending on yourrequirements, this might be sufficient. Otherwise look at moredenormalization, or building a thin layer above.


JG

Mr Hoberto wrote:

I can think of two cases that I've been wondering about (I am very new, and
am still reading the docs & archives, so I apologize if this has been
already covered or if I use the wrong notation...I'm still learning).

First case:

Tracking audits. In the RDMBS world you'd have the following schema:

User (userid, firstname, lastname)
Actions (actionid, actionname)
Audit (auditTime, userid, actionid)

I think the answer in the HBase world is to denormalize the data...have a
structure such as:

audits (auditid, audittime[timestamp], whowhat[family (firstName, lastname,
actionname)])

The problem happens, as Bharath says, what if a firstName or LastName needs
to be updated? Running a correction on all those denormalized rows is going
to be problematic.

Alternatively, I suppose you could store the User and Actions tables
separately, and keep the audits structure in HBase storing only IDs , and
use the website's application layer to "merge" the different data sets
together for display on a page. The downside there is if you wind up with a
significant amount of users or actions, it'll put a lot of memory pressure
on the app servers.

Second case:

Doing analysis on two time-series based data structures, such as a "PE
Ratio"

In the RDBS world you'd have two tables:

Prices (ticker, date, price)
Earnings (ticker, date, earning)

Again, I think the answer is denormalizing in the HBase world, with a
structure such as:

PEs (date, timestamp, PERatio[family (ticker, PEvalue)])

The problem here comes, again, with updates. For instance, what if you only
have available earnings information on an annual basis, and you've come
across a source that has it quarterly....You'll have to update 3/4 of the
rows in the denormalized table.

Once again, I apologize for any sort of misunderstanding..I'm still learning
the concepts behind column stores and map/reduce.

-hob


On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]> wrote:

The answer is, it depends.

What are the details of what you are trying to join?  Is it just a simple
1-to-1 join, or 1-to-many or what?  At a minimum, the join would require two
round-trips.  However, 0.20 can do simple queries in the 1-10ms time-range
(closer to 1ms when the blocks are already cached).

The comparison to an RDBMS cannot be made directly because a single-node
RDBMS with a smallish table will be quite fast at simple index-based joins.
 I would guess that unloaded, single machine performance of this join
operation would be much faster in an RDBMS.

But if your table has millions or billions of rows, it's a different
situation.  HBase performance will stay nearly constant as your table
increases, as long as you have the nodes to support your dataset and the
load.

What are your targets for time (sub 100ms? 10ms?), and what are the details
of what you're joining?

As far as code is concerned, there is not much to a simple join, so I'm not
sure how helpful it would be.  If you give some detail perhaps I can provide
some pseudo-code for you.

JG


bharath vissapragada wrote:

JG thanks for ur reply,

Actually iam trying to implement a realtime join of two tables on HBase .
Actually i tried the idea of denormalizing the tables to avoid the Joins ,
but when we do that Updating the data is really difficult .  I understand
that the features i am trying to implement are that of a RDBMS and HBase
is
used for a different purpose . Even then i want (rather i would like to
try)
to store the data  the data in HBase and implement Joins so that i  could
test its performance and if its effective (atleast on large number of
nodes)
, it maybe of somehelp to me . I know some ppl have already tried this .
If
anyone of already tried this can you just tellme how the results are .. i
mean are they good , when compared to RDBMS join on a single machine ...

Thanks

On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> wrote:

 Bharath,

You need to outline what your actual requirements are if you want more
help.  Open-ended questions that just ask for code are usually not
answered.

What exactly are you trying to join?  Does this join need to happen in
"realtime" or is this part of a batch process?

Could you denormalize your data to prevent needing the join at runtime?

If you provide details about exactly what your data/schema is like (or a
similar example if this is confidential), then many of us are more than
happy to help you figure out what approach my work best.

When working with HBase, figuring out how you want to pull your data out
is
key to how you want to put the data in.

JG


bharath vissapragada wrote:

 Amandeep , can you tell me what kinds of joins u have implemented ? and

which works the best (based on observation ).. Can u show us the source
code
(if possible)

Thanks in advance

On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
wrote:

 I've been doing joins by writing my own MR jobs. That works best.

Not tried cascading yet.

-ak

On 7/14/09, bharath vissapragada <[email protected]>
wrote:

 Thats fine .. I know that hbase has completely different usage

compared

 to

 SQL .. But for my application there is some kind of dependency

involved
among the tables . So i need to implement a Join . I wanted to know

 whether

 there is some kind of implementation already

..

Thanks
On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>

 wrote:

 HBase != SQL.

You might want map reduce or cascading.

On Tue, Jul 14, 2009 at 9:56 PM, bharath
vissapragada<[email protected]> wrote:

 Hi all ,

I want to join(similar to relational databases join) two tables in

 HBase

Can anyone tell me whether  it is already implemented in the source !

Thanks in Advance


 --


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Join in HBase

Reply via email to