BTW: Here's the Job Output https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5N1j_JvusDdDdaTG51OE1FOUptZHg5M1Zxc0FZbHc&hl=en_US
On Mon, Jul 11, 2011 at 1:28 PM, Juan P. <[email protected]> wrote: > Hi guys! Here's my mapred-site.xml > I've tweaked a few properties but still it's taking about 8-10mins to > process 4GB of data. Thought maybe you guys could find something you'd > comment on. > Thanks! > Pony > > *<?xml version="1.0"?>* > *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>* > * > * > *<configuration>* > * <property>* > * <name>mapred.job.tracker</name>* > * <value>name-node:54311</value>* > * </property>* > * <property>* > * <name>mapred.tasktracker.map.tasks.maximum</name>* > * <value>1</value>* > * </property>* > * <property>* > * <name>mapred.tasktracker.reduce.tasks.maximum</name>* > * <value>1</value>* > * </property>* > * <property>* > * <name>mapred.compress.map.output</name>* > * <value>true</value>* > * </property>* > * <property>* > * <name>mapred.map.output.compression.codec</name>* > * <value>org.apache.hadoop.io.compress.GzipCodec</value>* > * </property>* > * <property>* > * <name>mapred.child.java.opts</name>* > * <value>-Xmx400m</value>* > * </property>* > * <property>* > * <name>map.sort.class</name>* > * <value>org.apache.hadoop.util.HeapSort</value>* > * </property>* > * <property>* > * <name>mapred.reduce.slowstart.completed.maps</name>* > * <value>0.85</value>* > * </property>* > * <property>* > * <name>mapred.map.tasks.speculative.execution</name>* > * <value>false</value>* > * </property>* > * <property>* > * <name>mapred.reduce.tasks.speculative.execution</name>* > * <value>false</value>* > * </property>* > *</configuration>* > > On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi > <[email protected]>wrote: > >> Slow start is an important parameter. Definitely impacts job runtime. My >> experience in the past has been that, setting this parameter to too low or >> setting to too high can have issues with job latencies. If you are trying to >> run same job then its easy to set right value but if your cluster is >> multi-tenancy then getting this to right requires some benchmarking of >> different workloads concurrently. >> >> But you case is interesting, you are running on a single core(How many >> disks per node?). So setting to higher side of the spectrum as suggested by >> Joey makes sense. >> >> >> -Bharath >> >> >> >> >> >> ________________________________ >> From: Joey Echeverria <[email protected]> >> To: [email protected] >> Sent: Friday, July 8, 2011 9:14 AM >> Subject: Re: Cluster Tuning >> >> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. >> 1.0 means the maps have to completely finish before the reduce starts >> copying any data. I often run jobs with this set to .90-.95. >> >> -Joey >> >> On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[email protected]> wrote: >> > Here's another thought. I realized that the reduce operation in my >> > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the >> > mappers end. Is there a way to configure the cluster to make the reduce >> wait >> > for the map operations to complete? Specially considering my hardware >> > restraints >> > >> > Thanks! >> > Pony >> > >> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote: >> > >> >> Hey guys, >> >> Thanks all of you for your help. >> >> >> >> Joey, >> >> I tweaked my MapReduce to serialize/deserialize only escencial values >> and >> >> added a combiner and that helped a lot. Previously I had a domain >> object >> >> which was being passed between Mapper and Reducer when I only needed a >> >> single value. >> >> >> >> Esteban, >> >> I think you underestimate the constraints of my cluster. Adding >> multiple >> >> jobs per JVM really kills me in terms of memory. Not to mention that by >> >> having a single core there's not much to gain in terms of paralelism >> (other >> >> than perhaps while a process is waiting of an I/O operation). Still I >> gave >> >> it a shot, but even though I kept changing the config I always ended >> with a >> >> Java heap space error. >> >> >> >> Is it me or performance tuning is mostly a per job task? I mean it >> will, in >> >> the end, depend on the the data you are processing (structure, size, >> weather >> >> it's in one file or many, etc). If my jobs have different sets of data, >> >> which are in different formats and organized in different file >> structures, >> >> Do you guys recommend moving some of the configuration to Java code? >> >> >> >> Thanks! >> >> Pony >> >> >> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote: >> >> >> >>> Eres el Esteban que conozco? >> >>> >> >>> >> >>> >> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]> >> >>> escribió: >> >>> >> >>> > Hi Pony, >> >>> > >> >>> > There is a good chance that your boxes are doing some heavy swapping >> and >> >>> > that is a killer for Hadoop. Have you tried >> >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible >> the >> >>> > heap on that boxes? >> >>> > >> >>> > Cheers, >> >>> > Esteban. >> >>> > >> >>> > -- >> >>> > Get Hadoop! http://www.cloudera.com/downloads/ >> >>> > >> >>> > >> >>> > >> >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]> >> wrote: >> >>> > >> >>> >> Hi guys! >> >>> >> >> >>> >> I'd like some help fine tuning my cluster. I currently have 20 >> boxes >> >>> >> exactly >> >>> >> alike. Single core machines with 600MB of RAM. No chance of >> upgrading >> >>> the >> >>> >> hardware. >> >>> >> >> >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19 >> >>> >> DataNode/TaskTracker boxes. >> >>> >> >> >>> >> All my config is default except i've set the following in my >> >>> >> mapred-site.xml >> >>> >> in an effort to try and prevent choking my boxes. >> >>> >> *<property>* >> >>> >> * <name>mapred.tasktracker.map.tasks.maximum</name>* >> >>> >> * <value>1</value>* >> >>> >> * </property>* >> >>> >> >> >>> >> I'm running a MapReduce job which reads a Proxy Server log file >> (2GB), >> >>> maps >> >>> >> hosts to each record and then in the reduce task it accumulates the >> >>> amount >> >>> >> of bytes received from each host. >> >>> >> >> >>> >> Currently it's producing about 65000 keys >> >>> >> >> >>> >> The hole job takes forever to complete, specially the reduce part. >> I've >> >>> >> tried different tuning configs by I can't bring it down under >> 20mins. >> >>> >> >> >>> >> Any ideas? >> >>> >> >> >>> >> Thanks for your help! >> >>> >> Pony >> >>> >> >> >>> >> >> >> >> >> > >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 >> > >
