What Amandeep said.
Also, one clarification for you. You mentioned the reduce task moving
map output across regionservers. Remember, HBase is just a MapReduce
input source or output sink. The sort/shuffle/reduce is a part of
Hadoop MapReduce and has nothing to do with HBase directly. It is
utilizing the JobTracker/TaskTrackers, not the RegionServers.
Like AK said, you can increase the number of reducers, or reduce the
amount of data you output from the maps.
JG
Amandeep Khurana wrote:
On Thu, Aug 20, 2009 at 9:42 AM, john smith <[email protected]> wrote:
Hi all ,
I have one small doubt . Kindly answer it even if it sounds silly.
No questions are silly.. Dont worry
Iam using Map Reduce in HBase in distributed mode . I have a table which
spans across 5 region servers . I am using TableInputFormat to read the
data
from the tables in the map . When i run the program , by default how many
map regions are created ? Is it one per region server or more ?
If you set the number of map tasks to a high number, it automatically spawns
one map task for each region (not region server). Otherwise, it'll spawn the
number you have explicitly specified in the job.
Also after the map task is over.. reduce task is taking a bit more time .
Is
it due to moving the map output across the regionservers? i.e, moving the
values of same key to a particular reduce phase to start the reducer? Is
there any way i can optimize the code (e.g. by storing data of same reducer
nearby )
Increase the number of reducers. Each reducer will have lesser data to move.
Thanks :)