> Can you say a bit more about your processes?  Are they truly parallel maps
> without any shared state?

My map process reads a line of record from the input and explodes it into a
bunch of
key and value pairs. The number of key-value pair is typically small (around
8 on the average) 
and is independent of the size of the dataset. That is, it only depends on
the input record. 

I use a bunch of reporter objects to monitor the performance of the
map/reduce processes,
but I guess that does not affect performance. I do not access the disk etc. 
The map operation is purely CPU based.

> Are you getting a good limit on maximum number of maps and reduces per
> machine?

The number of maps/reduce per machine is set to two by the administrator. 
I am getting around 140 maps from the 70 machine cluster. 

> How are you measuring these times?  Do they include shuffle time as well
> as
> map time?  Do they include time before running?

Yes - the time includes both the shuffle time and map time. The shuffle time
is less than 30 seconds 
even for the largest tasks as I can see that it has started to process the
records 
(although very very slowly).
I can monitor the progress of each map process using the "reporter" objects.

> What happens on the large size problems if you decrease the number of
> maps,
> but keep input size constant?

My map processes run out of Heap space if I decrease the number of map
processes. 
The maximum number of records that my map process can handle is 250k as my
heap size is only 256m.

> Finally, why do you have so many reduces?  Usually it is good to have at
> most a small multiple of the number of machines in your cluster.

Yes - I agree and thanks for pointing out. In the large setup, I rarely get
past the map 
stage because it takes tens of hours for all the maps to finish.



> On 12/26/07 2:52 PM, "jag123" wrote:
> 
>> 
>> Hi,
>> 
>> I am running a map/reduce task on a large cluster (70+ machines). I use a
>> single input file, and sufficient number of map/reduce tasks so that each
>> map process gets 250k records. That is, if my  input file contains 1
>> million records, I use 4 map and 4 reduce processes so that each map
>> process
>> gets 250k records.  The maps/reduce usually takes 30 seconds to complete.
>> 
>> A strange thing happens when I scale this problem:
>> 
>> 1 million records, 4 map + 4 reduce ==> 30 seconds per map process
>> 5 million records, 20 map + 20 reduce ==>  1 minute per map process
>> 50 million records, 200 map + 200 reduce ==>  3 minute per map process
>> 500 million records, 2000 map + 2000 reduces ==> 45 minutes! per map
>> process
>> 
>> Note that in all the above cases, the map process performs the same
>> amount
>> of work (250k records).
>> 
>> In all the cases, I use a single large input file. Hadoop breaks the file
>> into ~16 MB chunks (about 250k records). Input format is
>> TextInputFormat.class. I cannot think of any reason why this is
>> happening.
>> The task setup in all the above cases takes 30 seconds or so. But then
>> the
>> map process practically crawls. 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Performance-issues-with-large-map-reduce-processes-tp14507375p14516774.html
Sent from the Hadoop Users mailing list archive at Nabble.com.

Reply via email to