Hi guys! Here's my mapred-site.xml
I've tweaked a few properties but still it's taking about 8-10mins to
process 4GB of data. Thought maybe you guys could find something you'd
comment on.
Thanks!
Pony

*<?xml version="1.0"?>*
*<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
*
*
*<configuration>*
*  <property>*
*    <name>mapred.job.tracker</name>*
*    <value>name-node:54311</value>*
*  </property>*
*  <property>*
*    <name>mapred.tasktracker.map.tasks.maximum</name>*
*    <value>1</value>*
*  </property>*
*  <property>*
*    <name>mapred.tasktracker.reduce.tasks.maximum</name>*
*    <value>1</value>*
*  </property>*
*  <property>*
*    <name>mapred.compress.map.output</name>*
*    <value>true</value>*
*  </property>*
*  <property>*
*    <name>mapred.map.output.compression.codec</name>*
*    <value>org.apache.hadoop.io.compress.GzipCodec</value>*
*  </property>*
*  <property>*
*    <name>mapred.child.java.opts</name>*
*    <value>-Xmx400m</value>*
*  </property>*
*  <property>*
*    <name>map.sort.class</name>*
*    <value>org.apache.hadoop.util.HeapSort</value>*
*  </property>*
*  <property>*
*    <name>mapred.reduce.slowstart.completed.maps</name>*
*    <value>0.85</value>*
*  </property>*
*  <property>*
*    <name>mapred.map.tasks.speculative.execution</name>*
*    <value>false</value>*
*  </property>*
*  <property>*
*    <name>mapred.reduce.tasks.speculative.execution</name>*
*    <value>false</value>*
*  </property>*
*</configuration>*

On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi <[email protected]>wrote:

> Slow start is an important parameter. Definitely impacts job runtime. My
> experience in the past has been that, setting this parameter to too low or
> setting to too high can have issues with job latencies. If you are trying to
> run same job then its easy to set right value but if your cluster is
> multi-tenancy then getting this to right requires some benchmarking of
> different workloads concurrently.
>
> But you case is interesting, you are running on a single core(How many
> disks per node?). So setting to higher side of the spectrum as suggested by
> Joey makes sense.
>
>
> -Bharath
>
>
>
>
>
> ________________________________
> From: Joey Echeverria <[email protected]>
> To: [email protected]
> Sent: Friday, July 8, 2011 9:14 AM
> Subject: Re: Cluster Tuning
>
> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
> 1.0 means the maps have to completely finish before the reduce starts
> copying any data. I often run jobs with this set to .90-.95.
>
> -Joey
>
> On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[email protected]> wrote:
> > Here's another thought. I realized that the reduce operation in my
> > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> > mappers end. Is there a way to configure the cluster to make the reduce
> wait
> > for the map operations to complete? Specially considering my hardware
> > restraints
> >
> > Thanks!
> > Pony
> >
> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote:
> >
> >> Hey guys,
> >> Thanks all of you for your help.
> >>
> >> Joey,
> >> I tweaked my MapReduce to serialize/deserialize only escencial values
> and
> >> added a combiner and that helped a lot. Previously I had a domain object
> >> which was being passed between Mapper and Reducer when I only needed a
> >> single value.
> >>
> >> Esteban,
> >> I think you underestimate the constraints of my cluster. Adding multiple
> >> jobs per JVM really kills me in terms of memory. Not to mention that by
> >> having a single core there's not much to gain in terms of paralelism
> (other
> >> than perhaps while a process is waiting of an I/O operation). Still I
> gave
> >> it a shot, but even though I kept changing the config I always ended
> with a
> >> Java heap space error.
> >>
> >> Is it me or performance tuning is mostly a per job task? I mean it will,
> in
> >> the end, depend on the the data you are processing (structure, size,
> weather
> >> it's in one file or many, etc). If my jobs have different sets of data,
> >> which are in different formats and organized in different  file
> structures,
> >> Do you guys recommend moving some of the configuration to Java code?
> >>
> >> Thanks!
> >> Pony
> >>
> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote:
> >>
> >>> Eres el Esteban que conozco?
> >>>
> >>>
> >>>
> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]>
> >>> escribió:
> >>>
> >>> > Hi Pony,
> >>> >
> >>> > There is a good chance that your boxes are doing some heavy swapping
> and
> >>> > that is a killer for Hadoop.  Have you tried
> >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
> the
> >>> > heap on that boxes?
> >>> >
> >>> > Cheers,
> >>> > Esteban.
> >>> >
> >>> > --
> >>> > Get Hadoop!  http://www.cloudera.com/downloads/
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]>
> wrote:
> >>> >
> >>> >> Hi guys!
> >>> >>
> >>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
> >>> >> exactly
> >>> >> alike. Single core machines with 600MB of RAM. No chance of
> upgrading
> >>> the
> >>> >> hardware.
> >>> >>
> >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
> >>> >> DataNode/TaskTracker boxes.
> >>> >>
> >>> >> All my config is default except i've set the following in my
> >>> >> mapred-site.xml
> >>> >> in an effort to try and prevent choking my boxes.
> >>> >> *<property>*
> >>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
> >>> >> *      <value>1</value>*
> >>> >> *  </property>*
> >>> >>
> >>> >> I'm running a MapReduce job which reads a Proxy Server log file
> (2GB),
> >>> maps
> >>> >> hosts to each record and then in the reduce task it accumulates the
> >>> amount
> >>> >> of bytes received from each host.
> >>> >>
> >>> >> Currently it's producing about 65000 keys
> >>> >>
> >>> >> The hole job takes forever to complete, specially the reduce part.
> I've
> >>> >> tried different tuning configs by I can't bring it down under
> 20mins.
> >>> >>
> >>> >> Any ideas?
> >>> >>
> >>> >> Thanks for your help!
> >>> >> Pony
> >>> >>
> >>>
> >>
> >>
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Reply via email to