Hello, I recently posted concerning some issues about "big" data insertion. I
would like to thank all the people who gave very interesting answers.
I would like to precise one point as an answer to this question.
> What exactly does your data look like / what are you trying to index?
> IndexedTable is NOT known to be very performant. If speed is of utmost
> concern, I would recommend you manage secondary indexing yourself, or
> look for other solutions like denormalization. If you help us
> understand what that query is actually doing we might be able to help
> you optimize your schema for it.
Here is a simplified view of my "data schema":
Table URLS = {url:String (for instance www.myurl.com), groupid:String (for
instance 126378), others_url_info:String}
Table GROUPS = {groupid:String, other_group_info:String}
Knowing that the relationship between URL and GROUP is 1,N ie
- one URL only belongs to one GROUP
- one GROUP contains many URLS (from 1000 to 1M)
URLS contain 1G rows
GROUPS contain 1M rows
My concern is to be able to look at all the URLS belonging to one GROUP.
Logically, I'm tempted to use an index on groupid in the URLS htable ?
Another point, i would like to discuss is concerning the way to balance region.
As I'm concerned with urls, I have a lot of url starting with 'w' (like for
instance www.hadoop.org). By the way during the process of data insertion, one
regionserver is overloaded with all the 'w' starting urls. Is there a way to
"prepare" the regions to avoid this problem ?
Guillaume
*********************************
This message and any attachments (the "message") are confidential and intended
solely for the addressees.
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration.
France Telecom Group shall not be liable for the message if altered, changed or
falsified.
If you are not the intended addressee of this message, please cancel it
immediately and inform the sender.
********************************