Hello, I recently posted concerning some issues about "big" data insertion. I 
would like to thank all the people who gave very interesting answers.
I would like to precise one point as an answer to this question.

> What exactly does your data look like / what are you trying to index?
> IndexedTable is NOT known to be very performant.  If speed is of utmost
> concern, I would recommend you manage secondary indexing yourself, or
> look for other solutions like denormalization.  If you help us
> understand what that query is actually doing we might be able to help
> you optimize your schema for it.      

Here is a simplified view of my "data schema":

Table URLS = {url:String (for instance www.myurl.com), groupid:String (for 
instance 126378), others_url_info:String}
Table GROUPS = {groupid:String, other_group_info:String}

Knowing that the relationship between URL and GROUP is 1,N ie 
- one URL only belongs to one GROUP
- one GROUP contains many URLS (from 1000 to 1M)

URLS contain 1G rows
GROUPS contain 1M rows

My concern is to be able to look at all the URLS belonging to one GROUP. 
Logically, I'm tempted to use an index on groupid in the URLS htable ?

Another point, i would like to discuss is concerning the way to balance region. 
As I'm concerned with urls, I have a lot of url starting with 'w' (like for 
instance www.hadoop.org). By the way during the process of data insertion, one 
regionserver is overloaded with all the 'w' starting urls. Is there a way to 
"prepare" the regions to avoid this problem ?


Guillaume


*********************************
This message and any attachments (the "message") are confidential and intended 
solely for the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or 
falsified.
If you are not the intended addressee of this message, please cancel it 
immediately and inform the sender.
********************************

Reply via email to