I can think of two cases that I've been wondering about (I am very new,
and
am still reading the docs & archives, so I apologize if this has been
already covered or if I use the wrong notation...I'm still learning).
First case:
Tracking audits. In the RDMBS world you'd have the following schema:
User (userid, firstname, lastname)
Actions (actionid, actionname)
Audit (auditTime, userid, actionid)
I think the answer in the HBase world is to denormalize the data...have a
structure such as:
audits (auditid, audittime[timestamp], whowhat[family (firstName,
lastname,
actionname)])
The problem happens, as Bharath says, what if a firstName or LastName
needs
to be updated? Running a correction on all those denormalized rows is
going
to be problematic.
Alternatively, I suppose you could store the User and Actions tables
separately, and keep the audits structure in HBase storing only IDs , and
use the website's application layer to "merge" the different data sets
together for display on a page. The downside there is if you wind up with
a
significant amount of users or actions, it'll put a lot of memory pressure
on the app servers.
Second case:
Doing analysis on two time-series based data structures, such as a "PE
Ratio"
In the RDBS world you'd have two tables:
Prices (ticker, date, price)
Earnings (ticker, date, earning)
Again, I think the answer is denormalizing in the HBase world, with a
structure such as:
PEs (date, timestamp, PERatio[family (ticker, PEvalue)])
The problem here comes, again, with updates. For instance, what if you
only
have available earnings information on an annual basis, and you've come
across a source that has it quarterly....You'll have to update 3/4 of the
rows in the denormalized table.
Once again, I apologize for any sort of misunderstanding..I'm still
learning
the concepts behind column stores and map/reduce.
-hob
On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]>
wrote:
The answer is, it depends.
What are the details of what you are trying to join? Is it just a simple
1-to-1 join, or 1-to-many or what? At a minimum, the join would require
two
round-trips. However, 0.20 can do simple queries in the 1-10ms
time-range
(closer to 1ms when the blocks are already cached).
The comparison to an RDBMS cannot be made directly because a single-node
RDBMS with a smallish table will be quite fast at simple index-based
joins.
I would guess that unloaded, single machine performance of this join
operation would be much faster in an RDBMS.
But if your table has millions or billions of rows, it's a different
situation. HBase performance will stay nearly constant as your table
increases, as long as you have the nodes to support your dataset and the
load.
What are your targets for time (sub 100ms? 10ms?), and what are the
details
of what you're joining?
As far as code is concerned, there is not much to a simple join, so I'm
not
sure how helpful it would be. If you give some detail perhaps I can
provide
some pseudo-code for you.
JG
bharath vissapragada wrote:
JG thanks for ur reply,
Actually iam trying to implement a realtime join of two tables on HBase
.
Actually i tried the idea of denormalizing the tables to avoid the Joins
,
but when we do that Updating the data is really difficult . I
understand
that the features i am trying to implement are that of a RDBMS and HBase
is
used for a different purpose . Even then i want (rather i would like to
try)
to store the data the data in HBase and implement Joins so that i
could
test its performance and if its effective (atleast on large number of
nodes)
, it maybe of somehelp to me . I know some ppl have already tried this .
If
anyone of already tried this can you just tellme how the results are ..
i
mean are they good , when compared to RDBMS join on a single machine ...
Thanks
On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]>
wrote:
Bharath,
You need to outline what your actual requirements are if you want more
help. Open-ended questions that just ask for code are usually not
answered.
What exactly are you trying to join? Does this join need to happen in
"realtime" or is this part of a batch process?
Could you denormalize your data to prevent needing the join at runtime?
If you provide details about exactly what your data/schema is like (or
a
similar example if this is confidential), then many of us are more than
happy to help you figure out what approach my work best.
When working with HBase, figuring out how you want to pull your data
out
is
key to how you want to put the data in.
JG
bharath vissapragada wrote:
Amandeep , can you tell me what kinds of joins u have implemented ?
and
which works the best (based on observation ).. Can u show us the
source
code
(if possible)
Thanks in advance
On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
wrote:
I've been doing joins by writing my own MR jobs. That works best.
Not tried cascading yet.
-ak
On 7/14/09, bharath vissapragada <[email protected]>
wrote:
Thats fine .. I know that hbase has completely different usage
compared
to
SQL .. But for my application there is some kind of dependency
involved
among the tables . So i need to implement a Join . I wanted to know
whether
there is some kind of implementation already
..
Thanks
On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>
wrote:
HBase != SQL.
You might want map reduce or cascading.
On Tue, Jul 14, 2009 at 9:56 PM, bharath
vissapragada<[email protected]> wrote:
Hi all ,
I want to join(similar to relational databases join) two tables in
HBase
.
Can anyone tell me whether it is already implemented in the source
!
Thanks in Advance
--
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz