[ https://issues.apache.org/jira/browse/MAPREDUCE-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866309#action_12866309 ]
Tudor Vlad commented on MAPREDUCE-1781: --------------------------------------- Thank you, I put the option in the config file and from the preliminary test it works as I intended. The reason I am using only 1 mapper/node at 1 time is because I am testing the parallelization efficiency of a highly-CPU and memory bound application over Hadoop, focusing on the parallelization via distributed computing, not multicore. Additionally, I will use more splits scenarious (different approx of a multiple of the no of nodes) on a heterogenous datacenter in order to determine a good input split size (and also how many nodes might be necessary to keep the scalability efficiency over 0.8). The application time is approx linear with the input size but has poor performance if the input is too small (I'm trying to find the exact point). In real-life scenarious I will use Hadoop with every resourse it has. However I have a question here: what if the cluster is highly heterogenous and I have both single-cores, dual-cores, dual processors with dual cores, quads, .... - is it possible to specify that I want 4 mappers/processors or am I limited to a static value at the startup of Hadoop? Regarding the initial problem, I think it would help a lot of people (especially new users) to specify in the config page[ http://hadoop.apache.org/common/docs/current/mapred-default.html ] which parameters are set at startup and which at job runtime. > option "-D mapred.tasktracker.map.tasks.maximum=1" does not work when no of > mappers is bigger than no of nodes - always spawns 2 mapers/node > -------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-1781 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1781 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/streaming > Affects Versions: 0.20.2 > Environment: Debian Lenny x64, and Hadoop 0.20.2, 2GB RAM > Reporter: Tudor Vlad > > Hello > I am a new user of Hadoop and I have some trouble using Hadoop Streaming and > the "-D mapred.tasktracker.map.tasks.maximum" option. > I'm experimenting with an unmanaged application (C++) which I want to run > over several nodes in 2 scenarios > 1) the number of maps (input splits) is equal to the number of nodes > 2) the number of maps is a multiple of the number of nodes (5, 10, 20, ... > Initially, when running the tests in scenario 1 I would sometimes get 2 > process/node on half the nodes. However I fixed this by adding the optin "-D > mapred.tasktracker.map.tasks.maximum=1", so everything works fine. > In the case of scenario 2 (more maps than nodes) this directive no longer > works, always obtaining 2 processes/node. I tested the even with putting > maximum=5 and I still get 2 processes/node. > The entire command I use is: > /usr/bin/time --format="-duration:\t%e |\t-MFaults:\t%F > |\t-ContxtSwitch:\t%w" \ > /opt/hadoop/bin/hadoop jar > /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ > -D mapred.tasktracker.map.tasks.maximum=1 \ > -D mapred.map.tasks=30 \ > -D mapred.reduce.tasks=0 \ > -D io.file.buffer.size=5242880 \ > -libjars "/opt/hadoop/contrib/streaming/hadoop-7debug.jar" \ > -input input/test \ > -output out1 \ > -mapper "/opt/jobdata/script_1k" \ > -inputformat "me.MyInputFormat" > Why is this happening and how can I make it work properly (i.e. be able to > limit exactly how many mappers I can have at 1 time per node)? > Thank you in advance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.