thanks matt,
Assuming therefore i run a single tasktracker and have 48 cores available. Based on your recommendation of 2:1 mappers to reducer threads i will be assigning:

mapred.tasktracker.map.tasks.maximum=30
mapred.tasktracker.reduce.tasks.maximum=15

This brings me onto my question:

"Can i confirm mapred.map.tasks and mapred.reduce.tasks*are these JobTracker parameters*? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering the box spec?"

I have read:

mapred.local.tasks = 10x of task trackers
mapred.reduce.tasks=2x task trackers

Given i have a single task tracker, with multiple concurrent processes does this equates to:

mapred.local.tasks =300?
mapred.reduce.tasks=30?

Some reasoning behind these values appreciated...


appreciate this is a little simplified and we will need to profile. Just looking for a sensible starting position.
Thanks
Dale


On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote:
Dale,

Talking solely about hadoop core you will only need to run 4 daemons on that 
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to 
run multiple of any of them as the tasktracker will spawn multiple child jvms 
which is where you will get your task parallelism. When you set your 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper 
bound of the child jvm creation but this needs to be configured based on job 
profile (I don't know much about Mahoot but traditionally I setup the clusters 
as 2:1 mappers to reducers until the profile proves otherwise). If you look at 
blogs / archives you will see that you can assign 1 child task per *logical* 
core (e.g. hyper threaded core) and to be safe you will want 1 daemon per 
*physical* core so you can divvy it up based on that recommendation.

To summarize the above: if you are sharing the same IO pipe / box then there is 
no reason to have multiple daemons running because you are not really gaining 
anything from that level of granularity. Others might disagree based on 
virtualization but in your case I would say save yourself the headache and keep 
it simple.

Matt

-----Original Message-----
From: Dale McDiarmid [mailto:[email protected]]
Sent: Thursday, December 15, 2011 1:50 PM
To: [email protected]
Subject: Large server recommedations

Hi all
New to the community and using hadoop and was looking for some advice as
to optimal configurations on very large servers.  I have a single server
with 48 cores and 512GB of RAM and am looking to perform an LDA analysis
using Mahoot across approx 180 million documents.  I have configured my
namenode and job tracker.  My questions are primarily around the optimal
number of tasktrackers and data nodes.  I have had no issues configuring
multiple datanodes, each which could potentially be utilised its own
disk location (underlying disk is SAN - solid state).

However, from my reading the typical architecture for hadoop is a larger
number of smaller nodes with a single tasktracker on each host.  Could
someone please clarify the following:

1. Can multiple task trackers be run on a single host? If so, how is
this configured as it doesn't seem possible to control the host:port.

2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker
parameters? The recommendation for these settings seems to related to
the number of task trackers.  In my architecture, i have potentially
only 1 if a single task tracker can only be configured on each host.
What should i set these values to therefore considering the box spec?

3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum - do these control the number of
JVM processes spawned to handle the respective steps? Is a tasktracker
with 48 configured equivalent to a 48 task trackers with a value of 1
configured for these values?

4. Benefits of a large number of datanodes on a single large server? I
can see value where the host has multiple IO interfaces and disk sets to
avoid IO contention. In my case, however, a SAN negates this.  Are there
still benefits of multiple datanodes outside of resiliency and potential
increase of data transfer i.e. assuming a single data node is limited
and single threaded?

5. Any other thoughts/recommended settings?

Thanks
Dale
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of 
"Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Reply via email to