Re: Cluster Tuning

Bharath Mundlapudi Fri, 08 Jul 2011 12:21:50 -0700

Slow start is an important parameter. Definitely impacts job runtime. My 
experience in the past has been that, setting this parameter to too low or 
setting to too high can have issues with job latencies. If you are trying to 
run same job then its easy to set right value but if your cluster is 
multi-tenancy then getting this to right requires some benchmarking of 
different workloads concurrently.


But you case is interesting, you are running on a single core(How many disks 
per node?). So setting to higher side of the spectrum as suggested by Joey 
makes sense. 


-Bharath





________________________________
From: Joey Echeverria <[email protected]>
To: [email protected]
Sent: Friday, July 8, 2011 9:14 AM
Subject: Re: Cluster Tuning

Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[email protected]> wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]>
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]> wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> *<property>*
>>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>>> >> *      <value>1</value>*
>>> >> *  </property>*
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Cluster Tuning

Reply via email to