This suggests to me that the problem is still one of insufficient number of 
parallel tasks.  Now that you have bumped up the number of tasks allowed to run 
at once, you need to bump up the number that are actually created.

From the documentation, "the number of maps is usually driven by the total size 
of the inputs, that is, the total number of blocks of the input files". 
Typically this makes sense because the tasks run locally on the machines that 
actually store the input data, so you are normally going to run out of CPU 
before you run out of disk bandwidth. In your case you are running very 
powerful hardware, so it makes sense to increase parallelism further.  You can 
read about how to increase the number of maps and reduces here: 
http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html#Mapper (I 
suggest reading this whole page anyway). 

Also, you should be aware that you can use the web ui to monitor how many tasks 
are created for each job and where they run.  Open a browser and go to 
http://<your master server>:50030/


----- Original Message -----
From: Ratner, Alan S (IS) <[email protected]>
To: [email protected] <[email protected]>
Sent: Wed Sep 15 13:38:22 2010
Subject: Re: Making optimum use of cores

Thanks for the quick responses.  I raised the 2 parameters to 14 (figuring 
there might be other apps running - like Zookeeper - that might want some cores 
of their own).  This has made a qualitative difference - the System Monitor now 
shows much higher squiggly lines indicating better distribution of the job to 
the various cores.  However, the quantitative difference is insignificant - my 
job runs about 4% faster.  I hope I don't have to migrate to use the C++ API 
from Java.

Alan


-----Original Message-----
From: Mohamed Riadh Trad [mailto:[email protected]] 
Sent: Wednesday, September 15, 2010 10:24 AM
To: [email protected]; [email protected]
Subject: EXTERNAL:Re: Making optimum use of cores

Hi Christopher,

I ve been  Working @Sungard(Global Trading), I left 2 yeas ago... Hope you 
enjoy working in there...

When it comes to performance, you should rather use the C++ API. By fixing the 
maps slots per node to Virtual Cpus number per Node,  u can fully parallelize 
jobs.. and use 16000% of the Nehalem CPU.

Regards,


Le 15 sept. 2010 à 16:00, <[email protected]> 
<[email protected]> a écrit :

> It seems likely that you are only running one (single-threaded) map or reduce 
> operation per worker node. Do you know whether you are in fact running 
> multiple operations?
> 
> This also sounds like it may be a manifestation of a question that I have 
> seen a lot on the mailing lists lately, which is that people do not know how 
> to increase the number of task slots in their tasktracker configuration.  
> This setting is normally controlled via the setting 
> mapred.tasktracker.{map|reduce}.tasks.maximum in mapred-site.xml.  The 
> default of 2 each is probably too low for your servers.
> 
> 
> ----- Original Message -----
> From: Ratner, Alan S (IS) <[email protected]>
> To: [email protected] <[email protected]>
> Sent: Wed Sep 15 09:47:47 2010
> Subject: Making optimum use of cores
> 
> I'm running Hadoop 0.20.2 on a cluster of servers running Ubuntu 10.4.
> Each server has 2 quad-core Nehalem CPUs for a total of 8 physical cores
> running as 16 virtual cores.  Ubuntu's System Monitor displays 16
> squiggly lines showing usage of the 16 virtual cores.  We only seem to
> be making use of one of the 16 virtual cores on any slave node and even
> on the master node only one virtual core is significantly busy at a
> time.  Is there a way to make better use of the cores?  Presumably I
> could run Hadoop in a VM assigned to each virtual core but I would think
> there must be a more elegant solution.
> 
> Alan Ratner
> 
> 


Reply via email to