Re: HBase table design question

Jean-Daniel Cryans Tue, 27 Oct 2009 11:58:22 -0700

Yeah it may be awkward to repeat the same information in your second
solution but that's usually how it's done, you could even drop the
"pageid" part of the qualifier and just call the family like that so
that "pageid:100" returns you... 100. But then you could denormalize
some more so let's say that there's only one value you really need
while doing that join which could be the page title so "pageid:100"
returns "Some webpage title" and maybe even save you from doing
another Get. You would probably duplicate a lot of data but if that
family is compressed then it doesn't have a big impact.


J-D

On Tue, Oct 27, 2009 at 11:48 AM, Something Something
<[email protected]> wrote:
> Thanks, Jean-Daniel, for the reply.  Greatly appreciate it.
>
> So is this the recommended way of implementing Parent-Child relationship in 
> HBase?  Like... a User Visits zero to many WebPages   or say...   a Customer 
> buys 1 to many Items.  In such cases, would we create a "Customer" HTable 
> with a "buys" family and keep adding "ItemsIds" for every "CustomerId"?  
> Sounds a bit akward for some reason.. but if that's the recommended way then 
> that's how I will implement it.  Please let me know what's the best way to 
> implement Parent-Child relationships in HBase is.
>
> Thanks.
>
>
>
>
> ________________________________
> From: Jean-Daniel Cryans <[email protected]>
> To: [email protected]
> Sent: Tue, October 27, 2009 11:06:04 AM
> Subject: Re: HBase table design question
>
> I think your question was just forgotten.
>
> So your value will not be overwritten, it will simply be on 2
> different timestamps and only the latest one will be retrieved if you
> do not specify one on your Get. By default 3 versions of that cell
> will be kept but you can change this with the family attributes.
>
> J-D
>
> On Tue, Oct 27, 2009 at 10:17 AM, Something Something
> <[email protected]> wrote:
>> No responses to this question :(  Is my question that stupid, I wonder!
>>
>>
>>
>>
>> ________________________________
>> From: Something Something <[email protected]>
>> To: [email protected]
>> Sent: Wed, October 21, 2009 12:16:19 PM
>> Subject: Re: HBase table design question
>>
>> Thanks, Jonathan for the reply.  One quick question...
>>
>> So in the User table when I perform the put operation:
>>
>> .put("visited", "pageId", 100);
>>
>> .put("visited", "pageId", 200);
>>
>> The 100 gets overwritten with 200.  Correct?  So should I use... something 
>> like this...
>>
>> .put("visited", "pageId100", 100);
>> .put("visited", "pageId200", 200);
>>
>> I guess, I am still missing something... sorry.. Please explain.  Thanks.
>>
>>
>>
>>
>> ________________________________
>> From: Jonathan Gray <[email protected]>
>> To: [email protected]
>> Sent: Wed, October 21, 2009 10:25:52 AM
>> Subject: Re: HBase table design question
>>
>> You're generally on the right track.  In many cases, rather than using 
>> secondary indexes in the relational world, you would have multiple tables in 
>> HBase with different keys.
>>
>> You may not need a table for each query, but that depends on your 
>> requirements of performance and the specific details of the data patterns 
>> (how sparse or dense certain things will be).
>>
>> I would start with a User table and a WebPage table, keyed by their ids.
>>
>> The User table could have a Visited family.  The WebPage table could have a 
>> VisitedBy family.
>>
>> Your queries could be run like this:
>>
>> 1) Get(table=User, row=userid, family=Visited, qualifier=WebPageID)
>>   There are a couple different ways you could model the data here. You could 
>> either put in a new version of the same qualifier for each visit, or you 
>> could make the qualifier a composite key like WebPageID+VisitStamp, so they 
>> would then be grouped together.
>>
>> 2) Get(table=User, row=userid, family=Visited)
>>   All qualifiers would represent all pages visited.
>>
>> 3) Get(table=WebPage, row=pageid, family=VisitedBy)
>>   All qualifiers would represent all users who visited.  You could store 
>> multiple visits by the same user in different ways, as above.
>>
>>
>> As for using hive to run these queries, that is not something I would 
>> recommend.  For one, hive integration with hbase is not complete (as far as 
>> I know).  Second, hive's emphasis is on batch/offline mapreduce jobs.   
>> Running the above 3 queries can be done with the HBase API directly, and 
>> efficiently.  There's no need for SQL or anything like it.
>>
>> Hope that helps.
>>
>> JG
>>
>> Something Something wrote:
>>> Hello,
>>>
>>> Trying to figure out what's the recommended way of designing tables under 
>>> HBase.  Let's say I need a table to gather statistics regarding user's 
>>> visits to different web pages.
>>>
>>> In the relational database world, we could have a table with following 
>>> columns:
>>>
>>> Primary Key (system generated)
>>> UserId (foreign key)
>>> WebPageId (foreign key)
>>> VisitedDateTime & so on....
>>>
>>> Basically, this table would allow us to answer (amongst many others) the 
>>> following questions...
>>>
>>> 1)  How many times a User visited a certain Page?
>>> 2)  Which web pages did a particular user visit?
>>> 3)  Which users visited a particular web page?  etc etc.
>>>
>>> What's the best way to model this in HTable?
>>> Since every HTable is really a distributed hashmap, does that mean I need 
>>> to create 3 different HTables (HashMaps) to answer these 3 questions?
>>>
>>> 1) One table with (UserId + WebPageId) as the compound key? (To answer #1)
>>> 2) One table with UserId as the key? (To answer #2)
>>> 3) One table with WebPageId as the key? (To answer #3)
>>>
>>> Along with HTable should I use Hive to run queries such as #1 above?
>>> Any help in this regard will be greatly appreciated.  Thanks.
>>>
>>>
>>>
>>
>>
>>
>
>
>
>

Re: HBase table design question

Reply via email to