RE: controlling no. of mapper tasks

GOEKE, MATTHEW (AG/1000) Mon, 20 Jun 2011 13:37:48 -0700

Praveen,

We use CDH3 so the link that I refer to is 
http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html. The reason 
why it is defaulting to 2 per node is not because it looks at the number of 
cores but that mapred.tasktracker.map.tasks.maximum is set to 2 by default. 
There is a wealth of information in each of the default sites and it might take 
playing with some of the sites before you are happy with the result. Remember 
that you can easily overfit your parameters to a specific data set and then 
have horrible performance with others so be careful when making selective 
changes. That being said here are possible ways to get it to work for you:

        - Raise your block size to something higher that your input file size 
and then reimport them into HDFS
        - Change mapred.jobtracker.maxtasks.per.job to 10
        - Setup a custom record reader for that data so that it doesn't split 
them
        - The list goes on...

If you are suffering a significant performance set back from it auto splitting 
files then there is possibly another culprit behind the scenes but here are 
some options for you.

HTH,
Matt

-----Original Message-----
From: praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com] 
Sent: Monday, June 20, 2011 3:13 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: controlling no. of mapper tasks

Hi David,
Thanks for the response. I didn't specify anything for no. of concurrent 
mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I 
believe hadoop is defaulting to no. of cores in the cluster which is 10. That 
is why I want to choose the map tasks also same as no. of cores so that they 
match with max concurrent map tasks. 

Praveen

-----Original Message-----
From: ext GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com] 
Sent: Monday, June 20, 2011 3:49 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: controlling no. of mapper tasks

Praveen,

David is correct but we might need to use different terminology. Hadoop looks 
at the number of input splits and if the file is not splittable then yes it 
will only use 1 mapper for it. In the case of most files (which are splittable) 
Hadoop will break them into multiple maps and work over each one. What you need 
to take a look at is the number of concurrent mappers / reducers that you have 
defined per node so that you do not cause context switches due to too many 
processes per core. Take a look in mapred-site.xml and you will see a default 
defined (if not take a look at the default mapred-site.xml for your version).

Matt

-----Original Message-----
From: praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com]
Sent: Monday, June 20, 2011 2:44 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: controlling no. of mapper tasks

Hi David,
I think Hadoop is looking at the data size, not the no. of input files. If I 
pass in .gz files, then yes hadoop is choosing 1 map task per file but if I 
pass in HUGE text file or same file split into 10 files, its choosing same no. 
of maps tasks (191 in my case).

Thanks
Praveen

-----Original Message-----
From: ext David Rosenstrauch [mailto:dar...@darose.net]
Sent: Monday, June 20, 2011 3:39 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: controlling no. of mapper tasks

On 06/20/2011 03:24 PM, praveen.pe...@nokia.com wrote:
> Hi there, I know client can send "mapred.reduce.tasks" to specify no.
> of reduce tasks and hadoop honours it but "mapred.map.tasks" is not 
> honoured by Hadoop. Is there any way to control number of map tasks?
> What I noticed is that Hadoop is choosing too many mappers and there 
> is an extra overhead being added due to this. For example, when I have 
> only 10 map tasks, my job finishes faster than when Hadoop chooses 191 
> map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I 
> want to set both map and reduce tasks to be 10 for max efficiency.
>
> Thanks Praveen

The number of map tasks is determined dynamically based on the number of input 
chunks you have.  If you want fewer map tasks either pass fewer input files to 
your job, or store the files using larger chunk sizes (which will result in 
fewer chunks per file, and thus fewer chunks total).

HTH,

DR
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled to receive such 
information. If you have received this e-mail in error, please notify the 
sender immediately. Please delete it and all attachments from any servers, hard 
drives or any other media. Other use of this e-mail by you is strictly 
prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its subsidiaries. The recipient of 
this e-mail is solely responsible for checking for the presence of "Viruses" or 
other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying this e-mail or any 
attachment.

The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially including but not 
limited to the Export Administration Regulations (EAR) and sanctions 
regulations issued by the U.S. Department of Treasury, Office of Foreign Asset 
Controls (OFAC).  As a recipient of this information you are obligated to 
comply with all applicable U.S. export laws and regulations.

RE: controlling no. of mapper tasks

Reply via email to