Re: Join in HBase

Jonathan Gray Fri, 17 Jul 2009 08:58:34 -0700

I didn't see much of an algorithm beyond simple usage of MR.

You read in everything, output the column value as the key to thereduce. They are then combined and you have the join in the reduce.

That's the general way you would do something simple like that inMapReduce. But this is not MR, there is no shuffle/sort/reduce.

Not sure how much more we can help without knowing the specifics of whatyou want to do. Hoberto provided some nice examples and breakdown ofhow you might solve them, that should help.


JG

bharath vissapragada wrote:

Hi ,

Did you see the algo of the map-red join i have implemented (i have written
it in the end of my prev mail.). Any comments abt that .. I mean any
improvements and stuff .

On Thu, Jul 16, 2009 at 11:12 PM, Jonathan Gray <[email protected]> wrote:

Hoberto, Bharath,

Designing these kinds of queries efficiently in HBase means doing multiple
round-trips, or denormalizing.

That means degrading performance as the query complexity increases, or lots
of data duplication and a complex write/update process.

In your audit example, you provide the denormalizing solution.  Store the
fields you need with the data you are querying (details of the user/action
in the audit table with the audit).  If you have to update those details,
then you have an extra expense on your write (and you introduce a potential
synchronization issue without transactions).

The choice about how to solve this really depends on the use case and what
your requirements are.  Can you ever afford to miss an update in one of the
denormalized fields, even if it is extremely unlikely?  You can build
transactional layers on top or you can take a look at TransactionalHBase
which attempts to do this in a more integrated way.

You also talk about the other approach, running multiple queries in the
application.  As far as memory pressure in the app is concerned, that would
really depend on the nature of the join.  It's more an issue of how many
joins you need to make, and if there's any way to reduce the number of
queries/trips needed.

If I am pulling the most recent 10 audits, and I need to join each with
both the User and Actions table, then we're talking about 1 + 10 + 10 total
queries.  That's not so pretty, but if done in a distributed or threaded way
may not be too bad.  In the future, I expect more and more tools/frameworks
available to aid in that process.

Today, this is up to you.

At Streamy, we solve these problems with layers above HBase.  Some of them
keep lots of stuff in-memory and do complex joins in-memory. Others
coordinate the multiple queries to HBase, with or without an OCC-style
transaction.

My suggestion is to start denormalized.  Build naive queries that do lots
of round-trips.  See what the performance is like under different conditions
and then go from there.  Perhaps Actions are generally immutable, their name
never changes, so you could denormalize that field and cut out half of the
total queries.  Have a pool of threads that grab Users so you can do the
join in parallel.  Depending on your requirements, this might be sufficient.
 Otherwise look at more denormalization, or building a thin layer above.

JG


Mr Hoberto wrote:

I can think of two cases that I've been wondering about (I am very new,
and
am still reading the docs & archives, so I apologize if this has been
already covered or if I use the wrong notation...I'm still learning).

First case:

Tracking audits. In the RDMBS world you'd have the following schema:

User (userid, firstname, lastname)
Actions (actionid, actionname)
Audit (auditTime, userid, actionid)

I think the answer in the HBase world is to denormalize the data...have a
structure such as:

audits (auditid, audittime[timestamp], whowhat[family (firstName,
lastname,
actionname)])

The problem happens, as Bharath says, what if a firstName or LastName
needs
to be updated? Running a correction on all those denormalized rows is
going
to be problematic.

Alternatively, I suppose you could store the User and Actions tables
separately, and keep the audits structure in HBase storing only IDs , and
use the website's application layer to "merge" the different data sets
together for display on a page. The downside there is if you wind up with
a
significant amount of users or actions, it'll put a lot of memory pressure
on the app servers.

Second case:

Doing analysis on two time-series based data structures, such as a "PE
Ratio"

In the RDBS world you'd have two tables:

Prices (ticker, date, price)
Earnings (ticker, date, earning)

Again, I think the answer is denormalizing in the HBase world, with a
structure such as:

PEs (date, timestamp, PERatio[family (ticker, PEvalue)])

The problem here comes, again, with updates. For instance, what if you
only
have available earnings information on an annual basis, and you've come
across a source that has it quarterly....You'll have to update 3/4 of the
rows in the denormalized table.

Once again, I apologize for any sort of misunderstanding..I'm still
learning
the concepts behind column stores and map/reduce.

-hob


On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]>
wrote:

 The answer is, it depends.

What are the details of what you are trying to join?  Is it just a simple
1-to-1 join, or 1-to-many or what?  At a minimum, the join would require
two
round-trips.  However, 0.20 can do simple queries in the 1-10ms
time-range
(closer to 1ms when the blocks are already cached).

The comparison to an RDBMS cannot be made directly because a single-node
RDBMS with a smallish table will be quite fast at simple index-based
joins.
 I would guess that unloaded, single machine performance of this join
operation would be much faster in an RDBMS.

But if your table has millions or billions of rows, it's a different
situation.  HBase performance will stay nearly constant as your table
increases, as long as you have the nodes to support your dataset and the
load.

What are your targets for time (sub 100ms? 10ms?), and what are the
details
of what you're joining?

As far as code is concerned, there is not much to a simple join, so I'm
not
sure how helpful it would be.  If you give some detail perhaps I can
provide
some pseudo-code for you.

JG


bharath vissapragada wrote:

 JG thanks for ur reply,

Actually iam trying to implement a realtime join of two tables on HBase
.
Actually i tried the idea of denormalizing the tables to avoid the Joins
,
but when we do that Updating the data is really difficult .  I
understand
that the features i am trying to implement are that of a RDBMS and HBase
is
used for a different purpose . Even then i want (rather i would like to
try)
to store the data  the data in HBase and implement Joins so that i
 could
test its performance and if its effective (atleast on large number of
nodes)
, it maybe of somehelp to me . I know some ppl have already tried this .
If
anyone of already tried this can you just tellme how the results are ..
i
mean are they good , when compared to RDBMS join on a single machine ...

Thanks

On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]>
wrote:

 Bharath,

You need to outline what your actual requirements are if you want more
help.  Open-ended questions that just ask for code are usually not
answered.

What exactly are you trying to join?  Does this join need to happen in
"realtime" or is this part of a batch process?

Could you denormalize your data to prevent needing the join at runtime?

If you provide details about exactly what your data/schema is like (or
a
similar example if this is confidential), then many of us are more than
happy to help you figure out what approach my work best.

When working with HBase, figuring out how you want to pull your data
out
is
key to how you want to put the data in.

JG


bharath vissapragada wrote:

 Amandeep , can you tell me what kinds of joins u have implemented ?
and

which works the best (based on observation ).. Can u show us the
source
code
(if possible)

Thanks in advance

On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
wrote:

 I've been doing joins by writing my own MR jobs. That works best.

 Not tried cascading yet.

-ak

On 7/14/09, bharath vissapragada <[email protected]>
wrote:

 Thats fine .. I know that hbase has completely different usage

compared

 to

 SQL .. But for my application there is some kind of dependency

involved
among the tables . So i need to implement a Join . I wanted to know

 whether

 there is some kind of implementation already

..

Thanks
On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>

 wrote:

 HBase != SQL.

You might want map reduce or cascading.

On Tue, Jul 14, 2009 at 9:56 PM, bharath
vissapragada<[email protected]> wrote:

 Hi all ,

I want to join(similar to relational databases join) two tables in

 HBase

 Can anyone tell me whether  it is already implemented in the source

Thanks in Advance


 --

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Join in HBase

Reply via email to