Specifically, you may want to follow
https://issues.apache.org/jira/browse/HIVE-1257, which is a ticket for
debugging the current implementation of joins over HBase tables using Hive.

On Fri, Mar 19, 2010 at 9:46 AM, Jonathan Gray <[email protected]> wrote:

> What you're asking for is a join.  You said you understand there isn't a
> mechanism to do it but then ask if there is functionality to provide
> combining the data.  They are equivalent.
>
> One thing to understand is that you're talking about a very traditional
> relational data model.  That fits very well into an RDBMS and less so into
> an HBase model.  However it is still possible to implement it in the same
> way as an RDBMS (by doing your own joining) or in a different way by
> denormalizing the data.
>
> To denormalize the data you would combine these things into a single table
> (or fewer than three), or in each table duplicate the data for the others.
>
> For example, let's say a customer can have any number of claims
> (1-to-many).  Rather than thinking of it like a relational database where
> each of these things are in a different table and reference one another, you
> might just toss them into a single table.
>
> The customer table (keyed on customerid) would have a 'claims' family.  For
> each claim, you could insert a column with the claimid (or a composite
> column if you needed time sorting, prepended with a stamp for example).  The
> value would be the claim information in a serialized type.  If you wanted to
> not use a serialized type, you could still spread each claim over multiple
> columns by adding additional type information into the column qualifier.
>  For example:  <timestamp><claimid><fieldname> and in the value
> <fieldvalue>.  You have to use filters to get everything for a claimid,
> which is unfortunate (would actually be possible to implement start/stop
> keyvalues but currently not supported).  In that case, you might make the
> table tall instead of wide and push these things into the row key.
>  <customerid><policyid><timestamp><claimid> and then you could have column
> qualifiers -> values for each field.  This would allow you to do a Get for a
> single claim (you'd have to know the row key to do a get), but would allow
> you do to queries like "give me all policies and claims for this customer",
> "give me the 10 most recent claims for this customer's policy", etc...
>
> For your specific example, where you don't want to pivot on the customer
> first but rather the time of the claim, you might create a table with rows
> such as <claim_timestamp><claim_id>.  Then you could use scanners to grab
> any claims within any range of time (rows from now() to now() - 1 month).
>
> Whether you denormalize the claims and store their full content in the
> table is another question.  The trade-off is really just about how much data
> there is, how many times you would need to duplicate it (you may need to
> create a new table for every query you want to support if they each pivot on
> a different column, time claim customer policy etc), etc.. So the trade-off
> is:  if denormalizing you get significantly faster reads at the expense of
> slower writes and data duplication.  If joining, you get better space
> efficiency and faster writes at the expense of slower reads.
>
> One of the advantages of HBase over an RDBMS is that you get to choose
> these trade-offs.  Often times in an RDBMS (especially in "by the book"
> schema design) there is one way and you don't have this flexibility.
>
> Hope that helps more than it confuses :)
>
> JG
>
> > -----Original Message-----
> > From: Basmajian, Raffi [mailto:[email protected]]
> > Sent: Friday, March 19, 2010 9:20 AM
> > To: [email protected]
> > Subject: RE: How to join tables in HBase 20.3
> >
> > JG,
> >
> > I understand that there is no built in mechanism to do joins, but the
> > essence of combining data to make it more useful remains the same
> > regardless of whether it's a rdmbs, hbase, etc, so there must be
> > something in hbase that provided this functionality.
> >
> > Assume for the moment that in hbase I have the tables Customer, Policy,
> > and Claim for an auto insurance business. Say I want to get a list of
> > all customers that filed a claim on their auto policy in the past
> > month.
> > If I use Get and/or Scan then that allows me to pull information from
> > each individual table, but I still need to combine the data to give me
> > the list of policies based on my original query. Is there additional
> > functionality in hbase that enables combining the data? I've been
> > searching in the samples and I can't find a clear and simple example.
> >
> > Thanks
> > Raffi
> >
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:[email protected]]
> > Sent: Friday, March 19, 2010 12:03 PM
> > To: [email protected]
> > Subject: RE: How to join tables in HBase 20.3
> >
> > At some point joins may be necessary when denormalization is not
> > possible.
> >
> > There is no built-in mechanism to do it.  It would be a series of
> > additional Get calls to the second table you are joining against.  This
> > would be helped significantly with a parallel MultiGet which will
> > hopefully make it to 0.21.
> >
> > JG
> >
> > > -----Original Message-----
> > > From: TuX RaceR [mailto:[email protected]]
> > > Sent: Friday, March 19, 2010 8:41 AM
> > > To: [email protected]
> > > Subject: Re: How to join tables in HBase 20.3
> > >
> > > Hi Raffi,
> > >
> > > when dealing with key-value stores, you need to think in a different
> > > way see for instance:
> > >
> > > http://wiki.apache.org/hadoop/Hbase/DataModel
> > >
> > > "Getting high scalability from your relational database isn't done by
> > > simply adding more machines because its data model is based on a
> > > single-machine architecture. For example, a JOIN between two tables
> > is
> >
> > > done in memory and does not take into account the possibility that
> > the
> >
> > > data has to go over the wire."
> > >
> > > JOIN simply does not scale in relational databases.
> > >
> > >
> > > see also
> > >
> > > http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> > >
> > > *20 Are there any Schema Design examples?*
> > >
> > >
> > > Hope this helps,
> > >
> > > Cheers
> > > TuX
> > >
> > >
> > > Basmajian, Raffi wrote:
> > > > I am new to HBase and come from a rdbms background. After looking
> > in
> > > the
> > > > sample client code it seems fairly easy to query a single table
> > > > using Get and Scan, but it's not so obvious how to join data across
> > > multiple
> > > > tables.
> > > >
> > > > Are there any examples on how to read/join data across multiple
> > > tables?
> > > >
> > > > Thank you
> > > >
> > > > Raffi Basmajian
> > > >
> > > >
> > > > -------------------------------------------------------------------
> > -
> > > > -
> > > ---------
> > > > This e-mail transmission may contain information that is
> > > > proprietary,
> > > privileged and/or confidential and is intended exclusively for the
> > > person(s) to whom it is addressed. Any use, copying, retention or
> > > disclosure by any person other than the intended recipient or the
> > > intended recipient's designees is strictly prohibited. If you are not
> > > the intended recipient or their designee, please notify the sender
> > > immediately by return e-mail and delete all copies. OppenheimerFunds
> > > may, at its sole discretion, monitor, review, retain and/or disclose
> > > the content of all email communications.
> > > >
> > >
> > ======================================================================
> > > =
> > > =======
> > > >
> > > >
> >
> >
> >
> > -----------------------------------------------------------------------
> > -------
> > This e-mail transmission may contain information that is proprietary,
> > privileged and/or confidential and is intended exclusively for the
> > person(s) to whom it is addressed. Any use, copying, retention or
> > disclosure by any person other than the intended recipient or the
> > intended recipient's designees is strictly prohibited. If you are not
> > the intended recipient or their designee, please notify the sender
> > immediately by return e-mail and delete all copies. OppenheimerFunds
> > may, at its sole discretion, monitor, review, retain and/or disclose
> > the content of all email communications.
> > =======================================================================
> > =======
>
>

Reply via email to