What you're asking for is a join.  You said you understand there isn't a 
mechanism to do it but then ask if there is functionality to provide combining 
the data.  They are equivalent.

One thing to understand is that you're talking about a very traditional 
relational data model.  That fits very well into an RDBMS and less so into an 
HBase model.  However it is still possible to implement it in the same way as 
an RDBMS (by doing your own joining) or in a different way by denormalizing the 
data.

To denormalize the data you would combine these things into a single table (or 
fewer than three), or in each table duplicate the data for the others.

For example, let's say a customer can have any number of claims (1-to-many).  
Rather than thinking of it like a relational database where each of these 
things are in a different table and reference one another, you might just toss 
them into a single table.

The customer table (keyed on customerid) would have a 'claims' family.  For 
each claim, you could insert a column with the claimid (or a composite column 
if you needed time sorting, prepended with a stamp for example).  The value 
would be the claim information in a serialized type.  If you wanted to not use 
a serialized type, you could still spread each claim over multiple columns by 
adding additional type information into the column qualifier.  For example:  
<timestamp><claimid><fieldname> and in the value <fieldvalue>.  You have to use 
filters to get everything for a claimid, which is unfortunate (would actually 
be possible to implement start/stop keyvalues but currently not supported).  In 
that case, you might make the table tall instead of wide and push these things 
into the row key.  <customerid><policyid><timestamp><claimid> and then you 
could have column qualifiers -> values for each field.  This would allow you to 
do a Get for a single claim (you'd have to know the row key to do a get), but 
would allow you do to queries like "give me all policies and claims for this 
customer", "give me the 10 most recent claims for this customer's policy", 
etc...

For your specific example, where you don't want to pivot on the customer first 
but rather the time of the claim, you might create a table with rows such as 
<claim_timestamp><claim_id>.  Then you could use scanners to grab any claims 
within any range of time (rows from now() to now() - 1 month).

Whether you denormalize the claims and store their full content in the table is 
another question.  The trade-off is really just about how much data there is, 
how many times you would need to duplicate it (you may need to create a new 
table for every query you want to support if they each pivot on a different 
column, time claim customer policy etc), etc.. So the trade-off is:  if 
denormalizing you get significantly faster reads at the expense of slower 
writes and data duplication.  If joining, you get better space efficiency and 
faster writes at the expense of slower reads.

One of the advantages of HBase over an RDBMS is that you get to choose these 
trade-offs.  Often times in an RDBMS (especially in "by the book" schema 
design) there is one way and you don't have this flexibility.

Hope that helps more than it confuses :)

JG 

> -----Original Message-----
> From: Basmajian, Raffi [mailto:[email protected]]
> Sent: Friday, March 19, 2010 9:20 AM
> To: [email protected]
> Subject: RE: How to join tables in HBase 20.3
> 
> JG,
> 
> I understand that there is no built in mechanism to do joins, but the
> essence of combining data to make it more useful remains the same
> regardless of whether it's a rdmbs, hbase, etc, so there must be
> something in hbase that provided this functionality.
> 
> Assume for the moment that in hbase I have the tables Customer, Policy,
> and Claim for an auto insurance business. Say I want to get a list of
> all customers that filed a claim on their auto policy in the past
> month.
> If I use Get and/or Scan then that allows me to pull information from
> each individual table, but I still need to combine the data to give me
> the list of policies based on my original query. Is there additional
> functionality in hbase that enables combining the data? I've been
> searching in the samples and I can't find a clear and simple example.
> 
> Thanks
> Raffi
> 
> 
> -----Original Message-----
> From: Jonathan Gray [mailto:[email protected]]
> Sent: Friday, March 19, 2010 12:03 PM
> To: [email protected]
> Subject: RE: How to join tables in HBase 20.3
> 
> At some point joins may be necessary when denormalization is not
> possible.
> 
> There is no built-in mechanism to do it.  It would be a series of
> additional Get calls to the second table you are joining against.  This
> would be helped significantly with a parallel MultiGet which will
> hopefully make it to 0.21.
> 
> JG
> 
> > -----Original Message-----
> > From: TuX RaceR [mailto:[email protected]]
> > Sent: Friday, March 19, 2010 8:41 AM
> > To: [email protected]
> > Subject: Re: How to join tables in HBase 20.3
> >
> > Hi Raffi,
> >
> > when dealing with key-value stores, you need to think in a different
> > way see for instance:
> >
> > http://wiki.apache.org/hadoop/Hbase/DataModel
> >
> > "Getting high scalability from your relational database isn't done by
> > simply adding more machines because its data model is based on a
> > single-machine architecture. For example, a JOIN between two tables
> is
> 
> > done in memory and does not take into account the possibility that
> the
> 
> > data has to go over the wire."
> >
> > JOIN simply does not scale in relational databases.
> >
> >
> > see also
> >
> > http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> >
> > *20 Are there any Schema Design examples?*
> >
> >
> > Hope this helps,
> >
> > Cheers
> > TuX
> >
> >
> > Basmajian, Raffi wrote:
> > > I am new to HBase and come from a rdbms background. After looking
> in
> > the
> > > sample client code it seems fairly easy to query a single table
> > > using Get and Scan, but it's not so obvious how to join data across
> > multiple
> > > tables.
> > >
> > > Are there any examples on how to read/join data across multiple
> > tables?
> > >
> > > Thank you
> > >
> > > Raffi Basmajian
> > >
> > >
> > > -------------------------------------------------------------------
> -
> > > -
> > ---------
> > > This e-mail transmission may contain information that is
> > > proprietary,
> > privileged and/or confidential and is intended exclusively for the
> > person(s) to whom it is addressed. Any use, copying, retention or
> > disclosure by any person other than the intended recipient or the
> > intended recipient's designees is strictly prohibited. If you are not
> > the intended recipient or their designee, please notify the sender
> > immediately by return e-mail and delete all copies. OppenheimerFunds
> > may, at its sole discretion, monitor, review, retain and/or disclose
> > the content of all email communications.
> > >
> >
> ======================================================================
> > =
> > =======
> > >
> > >
> 
> 
> 
> -----------------------------------------------------------------------
> -------
> This e-mail transmission may contain information that is proprietary,
> privileged and/or confidential and is intended exclusively for the
> person(s) to whom it is addressed. Any use, copying, retention or
> disclosure by any person other than the intended recipient or the
> intended recipient's designees is strictly prohibited. If you are not
> the intended recipient or their designee, please notify the sender
> immediately by return e-mail and delete all copies. OppenheimerFunds
> may, at its sole discretion, monitor, review, retain and/or disclose
> the content of all email communications.
> =======================================================================
> =======

Reply via email to