Hi guys! Here's my mapred-site.xml I've tweaked a few properties but still it's taking about 8-10mins to process 4GB of data. Thought maybe you guys could find something you'd comment on. Thanks! Pony
*<?xml version="1.0"?>* *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>* * * *<configuration>* * <property>* * <name>mapred.job.tracker</name>* * <value>name-node:54311</value>* * </property>* * <property>* * <name>mapred.tasktracker.map.tasks.maximum</name>* * <value>1</value>* * </property>* * <property>* * <name>mapred.tasktracker.reduce.tasks.maximum</name>* * <value>1</value>* * </property>* * <property>* * <name>mapred.compress.map.output</name>* * <value>true</value>* * </property>* * <property>* * <name>mapred.map.output.compression.codec</name>* * <value>org.apache.hadoop.io.compress.GzipCodec</value>* * </property>* * <property>* * <name>mapred.child.java.opts</name>* * <value>-Xmx400m</value>* * </property>* * <property>* * <name>map.sort.class</name>* * <value>org.apache.hadoop.util.HeapSort</value>* * </property>* * <property>* * <name>mapred.reduce.slowstart.completed.maps</name>* * <value>0.85</value>* * </property>* * <property>* * <name>mapred.map.tasks.speculative.execution</name>* * <value>false</value>* * </property>* * <property>* * <name>mapred.reduce.tasks.speculative.execution</name>* * <value>false</value>* * </property>* *</configuration>* On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi <[email protected]>wrote: > Slow start is an important parameter. Definitely impacts job runtime. My > experience in the past has been that, setting this parameter to too low or > setting to too high can have issues with job latencies. If you are trying to > run same job then its easy to set right value but if your cluster is > multi-tenancy then getting this to right requires some benchmarking of > different workloads concurrently. > > But you case is interesting, you are running on a single core(How many > disks per node?). So setting to higher side of the spectrum as suggested by > Joey makes sense. > > > -Bharath > > > > > > ________________________________ > From: Joey Echeverria <[email protected]> > To: [email protected] > Sent: Friday, July 8, 2011 9:14 AM > Subject: Re: Cluster Tuning > > Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. > 1.0 means the maps have to completely finish before the reduce starts > copying any data. I often run jobs with this set to .90-.95. > > -Joey > > On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[email protected]> wrote: > > Here's another thought. I realized that the reduce operation in my > > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the > > mappers end. Is there a way to configure the cluster to make the reduce > wait > > for the map operations to complete? Specially considering my hardware > > restraints > > > > Thanks! > > Pony > > > > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote: > > > >> Hey guys, > >> Thanks all of you for your help. > >> > >> Joey, > >> I tweaked my MapReduce to serialize/deserialize only escencial values > and > >> added a combiner and that helped a lot. Previously I had a domain object > >> which was being passed between Mapper and Reducer when I only needed a > >> single value. > >> > >> Esteban, > >> I think you underestimate the constraints of my cluster. Adding multiple > >> jobs per JVM really kills me in terms of memory. Not to mention that by > >> having a single core there's not much to gain in terms of paralelism > (other > >> than perhaps while a process is waiting of an I/O operation). Still I > gave > >> it a shot, but even though I kept changing the config I always ended > with a > >> Java heap space error. > >> > >> Is it me or performance tuning is mostly a per job task? I mean it will, > in > >> the end, depend on the the data you are processing (structure, size, > weather > >> it's in one file or many, etc). If my jobs have different sets of data, > >> which are in different formats and organized in different file > structures, > >> Do you guys recommend moving some of the configuration to Java code? > >> > >> Thanks! > >> Pony > >> > >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote: > >> > >>> Eres el Esteban que conozco? > >>> > >>> > >>> > >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]> > >>> escribió: > >>> > >>> > Hi Pony, > >>> > > >>> > There is a good chance that your boxes are doing some heavy swapping > and > >>> > that is a killer for Hadoop. Have you tried > >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible > the > >>> > heap on that boxes? > >>> > > >>> > Cheers, > >>> > Esteban. > >>> > > >>> > -- > >>> > Get Hadoop! http://www.cloudera.com/downloads/ > >>> > > >>> > > >>> > > >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]> > wrote: > >>> > > >>> >> Hi guys! > >>> >> > >>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes > >>> >> exactly > >>> >> alike. Single core machines with 600MB of RAM. No chance of > upgrading > >>> the > >>> >> hardware. > >>> >> > >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19 > >>> >> DataNode/TaskTracker boxes. > >>> >> > >>> >> All my config is default except i've set the following in my > >>> >> mapred-site.xml > >>> >> in an effort to try and prevent choking my boxes. > >>> >> *<property>* > >>> >> * <name>mapred.tasktracker.map.tasks.maximum</name>* > >>> >> * <value>1</value>* > >>> >> * </property>* > >>> >> > >>> >> I'm running a MapReduce job which reads a Proxy Server log file > (2GB), > >>> maps > >>> >> hosts to each record and then in the reduce task it accumulates the > >>> amount > >>> >> of bytes received from each host. > >>> >> > >>> >> Currently it's producing about 65000 keys > >>> >> > >>> >> The hole job takes forever to complete, specially the reduce part. > I've > >>> >> tried different tuning configs by I can't bring it down under > 20mins. > >>> >> > >>> >> Any ideas? > >>> >> > >>> >> Thanks for your help! > >>> >> Pony > >>> >> > >>> > >> > >> > > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
