Specifically, you may want to follow https://issues.apache.org/jira/browse/HIVE-1257, which is a ticket for debugging the current implementation of joins over HBase tables using Hive.
On Fri, Mar 19, 2010 at 9:46 AM, Jonathan Gray <[email protected]> wrote: > What you're asking for is a join. You said you understand there isn't a > mechanism to do it but then ask if there is functionality to provide > combining the data. They are equivalent. > > One thing to understand is that you're talking about a very traditional > relational data model. That fits very well into an RDBMS and less so into > an HBase model. However it is still possible to implement it in the same > way as an RDBMS (by doing your own joining) or in a different way by > denormalizing the data. > > To denormalize the data you would combine these things into a single table > (or fewer than three), or in each table duplicate the data for the others. > > For example, let's say a customer can have any number of claims > (1-to-many). Rather than thinking of it like a relational database where > each of these things are in a different table and reference one another, you > might just toss them into a single table. > > The customer table (keyed on customerid) would have a 'claims' family. For > each claim, you could insert a column with the claimid (or a composite > column if you needed time sorting, prepended with a stamp for example). The > value would be the claim information in a serialized type. If you wanted to > not use a serialized type, you could still spread each claim over multiple > columns by adding additional type information into the column qualifier. > For example: <timestamp><claimid><fieldname> and in the value > <fieldvalue>. You have to use filters to get everything for a claimid, > which is unfortunate (would actually be possible to implement start/stop > keyvalues but currently not supported). In that case, you might make the > table tall instead of wide and push these things into the row key. > <customerid><policyid><timestamp><claimid> and then you could have column > qualifiers -> values for each field. This would allow you to do a Get for a > single claim (you'd have to know the row key to do a get), but would allow > you do to queries like "give me all policies and claims for this customer", > "give me the 10 most recent claims for this customer's policy", etc... > > For your specific example, where you don't want to pivot on the customer > first but rather the time of the claim, you might create a table with rows > such as <claim_timestamp><claim_id>. Then you could use scanners to grab > any claims within any range of time (rows from now() to now() - 1 month). > > Whether you denormalize the claims and store their full content in the > table is another question. The trade-off is really just about how much data > there is, how many times you would need to duplicate it (you may need to > create a new table for every query you want to support if they each pivot on > a different column, time claim customer policy etc), etc.. So the trade-off > is: if denormalizing you get significantly faster reads at the expense of > slower writes and data duplication. If joining, you get better space > efficiency and faster writes at the expense of slower reads. > > One of the advantages of HBase over an RDBMS is that you get to choose > these trade-offs. Often times in an RDBMS (especially in "by the book" > schema design) there is one way and you don't have this flexibility. > > Hope that helps more than it confuses :) > > JG > > > -----Original Message----- > > From: Basmajian, Raffi [mailto:[email protected]] > > Sent: Friday, March 19, 2010 9:20 AM > > To: [email protected] > > Subject: RE: How to join tables in HBase 20.3 > > > > JG, > > > > I understand that there is no built in mechanism to do joins, but the > > essence of combining data to make it more useful remains the same > > regardless of whether it's a rdmbs, hbase, etc, so there must be > > something in hbase that provided this functionality. > > > > Assume for the moment that in hbase I have the tables Customer, Policy, > > and Claim for an auto insurance business. Say I want to get a list of > > all customers that filed a claim on their auto policy in the past > > month. > > If I use Get and/or Scan then that allows me to pull information from > > each individual table, but I still need to combine the data to give me > > the list of policies based on my original query. Is there additional > > functionality in hbase that enables combining the data? I've been > > searching in the samples and I can't find a clear and simple example. > > > > Thanks > > Raffi > > > > > > -----Original Message----- > > From: Jonathan Gray [mailto:[email protected]] > > Sent: Friday, March 19, 2010 12:03 PM > > To: [email protected] > > Subject: RE: How to join tables in HBase 20.3 > > > > At some point joins may be necessary when denormalization is not > > possible. > > > > There is no built-in mechanism to do it. It would be a series of > > additional Get calls to the second table you are joining against. This > > would be helped significantly with a parallel MultiGet which will > > hopefully make it to 0.21. > > > > JG > > > > > -----Original Message----- > > > From: TuX RaceR [mailto:[email protected]] > > > Sent: Friday, March 19, 2010 8:41 AM > > > To: [email protected] > > > Subject: Re: How to join tables in HBase 20.3 > > > > > > Hi Raffi, > > > > > > when dealing with key-value stores, you need to think in a different > > > way see for instance: > > > > > > http://wiki.apache.org/hadoop/Hbase/DataModel > > > > > > "Getting high scalability from your relational database isn't done by > > > simply adding more machines because its data model is based on a > > > single-machine architecture. For example, a JOIN between two tables > > is > > > > > done in memory and does not take into account the possibility that > > the > > > > > data has to go over the wire." > > > > > > JOIN simply does not scale in relational databases. > > > > > > > > > see also > > > > > > http://wiki.apache.org/hadoop/Hbase/FAQ#A20 > > > > > > *20 Are there any Schema Design examples?* > > > > > > > > > Hope this helps, > > > > > > Cheers > > > TuX > > > > > > > > > Basmajian, Raffi wrote: > > > > I am new to HBase and come from a rdbms background. After looking > > in > > > the > > > > sample client code it seems fairly easy to query a single table > > > > using Get and Scan, but it's not so obvious how to join data across > > > multiple > > > > tables. > > > > > > > > Are there any examples on how to read/join data across multiple > > > tables? > > > > > > > > Thank you > > > > > > > > Raffi Basmajian > > > > > > > > > > > > ------------------------------------------------------------------- > > - > > > > - > > > --------- > > > > This e-mail transmission may contain information that is > > > > proprietary, > > > privileged and/or confidential and is intended exclusively for the > > > person(s) to whom it is addressed. Any use, copying, retention or > > > disclosure by any person other than the intended recipient or the > > > intended recipient's designees is strictly prohibited. If you are not > > > the intended recipient or their designee, please notify the sender > > > immediately by return e-mail and delete all copies. OppenheimerFunds > > > may, at its sole discretion, monitor, review, retain and/or disclose > > > the content of all email communications. > > > > > > > > > ====================================================================== > > > = > > > ======= > > > > > > > > > > > > > > > > ----------------------------------------------------------------------- > > ------- > > This e-mail transmission may contain information that is proprietary, > > privileged and/or confidential and is intended exclusively for the > > person(s) to whom it is addressed. Any use, copying, retention or > > disclosure by any person other than the intended recipient or the > > intended recipient's designees is strictly prohibited. If you are not > > the intended recipient or their designee, please notify the sender > > immediately by return e-mail and delete all copies. OppenheimerFunds > > may, at its sole discretion, monitor, review, retain and/or disclose > > the content of all email communications. > > ======================================================================= > > ======= > >
