Thanks, Jean-Daniel, for the reply. Greatly appreciate it. So is this the recommended way of implementing Parent-Child relationship in HBase? Like... a User Visits zero to many WebPages or say... a Customer buys 1 to many Items. In such cases, would we create a "Customer" HTable with a "buys" family and keep adding "ItemsIds" for every "CustomerId"? Sounds a bit akward for some reason.. but if that's the recommended way then that's how I will implement it. Please let me know what's the best way to implement Parent-Child relationships in HBase is.
Thanks. ________________________________ From: Jean-Daniel Cryans <[email protected]> To: [email protected] Sent: Tue, October 27, 2009 11:06:04 AM Subject: Re: HBase table design question I think your question was just forgotten. So your value will not be overwritten, it will simply be on 2 different timestamps and only the latest one will be retrieved if you do not specify one on your Get. By default 3 versions of that cell will be kept but you can change this with the family attributes. J-D On Tue, Oct 27, 2009 at 10:17 AM, Something Something <[email protected]> wrote: > No responses to this question :( Is my question that stupid, I wonder! > > > > > ________________________________ > From: Something Something <[email protected]> > To: [email protected] > Sent: Wed, October 21, 2009 12:16:19 PM > Subject: Re: HBase table design question > > Thanks, Jonathan for the reply. One quick question... > > So in the User table when I perform the put operation: > > .put("visited", "pageId", 100); > > .put("visited", "pageId", 200); > > The 100 gets overwritten with 200. Correct? So should I use... something > like this... > > .put("visited", "pageId100", 100); > .put("visited", "pageId200", 200); > > I guess, I am still missing something... sorry.. Please explain. Thanks. > > > > > ________________________________ > From: Jonathan Gray <[email protected]> > To: [email protected] > Sent: Wed, October 21, 2009 10:25:52 AM > Subject: Re: HBase table design question > > You're generally on the right track. In many cases, rather than using > secondary indexes in the relational world, you would have multiple tables in > HBase with different keys. > > You may not need a table for each query, but that depends on your > requirements of performance and the specific details of the data patterns > (how sparse or dense certain things will be). > > I would start with a User table and a WebPage table, keyed by their ids. > > The User table could have a Visited family. The WebPage table could have a > VisitedBy family. > > Your queries could be run like this: > > 1) Get(table=User, row=userid, family=Visited, qualifier=WebPageID) > There are a couple different ways you could model the data here. You could > either put in a new version of the same qualifier for each visit, or you > could make the qualifier a composite key like WebPageID+VisitStamp, so they > would then be grouped together. > > 2) Get(table=User, row=userid, family=Visited) > All qualifiers would represent all pages visited. > > 3) Get(table=WebPage, row=pageid, family=VisitedBy) > All qualifiers would represent all users who visited. You could store > multiple visits by the same user in different ways, as above. > > > As for using hive to run these queries, that is not something I would > recommend. For one, hive integration with hbase is not complete (as far as I > know). Second, hive's emphasis is on batch/offline mapreduce jobs. Running > the above 3 queries can be done with the HBase API directly, and efficiently. > There's no need for SQL or anything like it. > > Hope that helps. > > JG > > Something Something wrote: >> Hello, >> >> Trying to figure out what's the recommended way of designing tables under >> HBase. Let's say I need a table to gather statistics regarding user's >> visits to different web pages. >> >> In the relational database world, we could have a table with following >> columns: >> >> Primary Key (system generated) >> UserId (foreign key) >> WebPageId (foreign key) >> VisitedDateTime & so on.... >> >> Basically, this table would allow us to answer (amongst many others) the >> following questions... >> >> 1) How many times a User visited a certain Page? >> 2) Which web pages did a particular user visit? >> 3) Which users visited a particular web page? etc etc. >> >> What's the best way to model this in HTable? >> Since every HTable is really a distributed hashmap, does that mean I need to >> create 3 different HTables (HashMaps) to answer these 3 questions? >> >> 1) One table with (UserId + WebPageId) as the compound key? (To answer #1) >> 2) One table with UserId as the key? (To answer #2) >> 3) One table with WebPageId as the key? (To answer #3) >> >> Along with HTable should I use Hive to run queries such as #1 above? >> Any help in this regard will be greatly appreciated. Thanks. >> >> >> > > >
