Guillaume,

Thanks for providing more detail.

So, as I understand it, you are already storing the URL -> Group relationship (1:1), but you need to store Group -> URLs relationship (1:N).

My solution would be to have a "urls" family in your GROUPS table. And for each URL within a group, you would have a column in that family. The optimal read query would be "give me all urls for this group".

What you would actually store in that column/value is up to you, but I suppose you'd have to store the full URL in the column qualifier since that's the only unique key.

You just need to test how grabbing your upper-bound of 1M columns in a single row is going to perform with your data/cluster. HBase 0.20 does handle that many columns in a single row/family, but there is not currently a (non-manual) way to break up a row into multiple-requests, so I have seen users experience OOME issues when the row is too large.

The real solution is intra-row scanning (https://issues.apache.org/jira/browse/HBASE-1537), currently targeted for 0.21.

A potential short-term solution is to use LIMIT/OFFSET filters to get columns in batches (grab 1 through 200k, 200k through 400k, etc). However, I'd see if it just works as in my local testing it often times does work if you just give things enough memory.


And on your last question, if you reference the BigTable paper, they describe how they reverse domains to prevent this kind of problem and provide more desired locality by TLD.

(ie. org.apache.hadoop and com.google.www as row keys)


Hope that helps.

JG

[email protected] wrote:
Hello, I recently posted concerning some issues about "big" data insertion. I 
would like to thank all the people who gave very interesting answers.
I would like to precise one point as an answer to this question.

What exactly does your data look like / what are you trying to index?
IndexedTable is NOT known to be very performant.  If speed is of utmost
concern, I would recommend you manage secondary indexing yourself, or
look for other solutions like denormalization.  If you help us
understand what that query is actually doing we might be able to help
you optimize your schema for it.        

Here is a simplified view of my "data schema":

Table URLS = {url:String (for instance www.myurl.com), groupid:String (for 
instance 126378), others_url_info:String}
Table GROUPS = {groupid:String, other_group_info:String}

Knowing that the relationship between URL and GROUP is 1,N ie - one URL only belongs to one GROUP
- one GROUP contains many URLS (from 1000 to 1M)

URLS contain 1G rows
GROUPS contain 1M rows

My concern is to be able to look at all the URLS belonging to one GROUP. 
Logically, I'm tempted to use an index on groupid in the URLS htable ?

Another point, i would like to discuss is concerning the way to balance region. As I'm 
concerned with urls, I have a lot of url starting with 'w' (like for instance 
www.hadoop.org). By the way during the process of data insertion, one regionserver is 
overloaded with all the 'w' starting urls. Is there a way to "prepare" the 
regions to avoid this problem ?


Guillaume


*********************************
This message and any attachments (the "message") are confidential and intended solely for the addressees. Any unauthorised use or dissemination is prohibited. Messages are susceptible to alteration. France Telecom Group shall not be liable for the message if altered, changed or falsified.
If you are not the intended addressee of this message, please cancel it 
immediately and inform the sender.
********************************


Reply via email to