BTW: Here's the Job Output

https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5N1j_JvusDdDdaTG51OE1FOUptZHg5M1Zxc0FZbHc&hl=en_US

On Mon, Jul 11, 2011 at 1:28 PM, Juan P. <[email protected]> wrote:

> Hi guys! Here's my mapred-site.xml
> I've tweaked a few properties but still it's taking about 8-10mins to
> process 4GB of data. Thought maybe you guys could find something you'd
> comment on.
> Thanks!
> Pony
>
> *<?xml version="1.0"?>*
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
> *
> *
> *<configuration>*
> *  <property>*
> *    <name>mapred.job.tracker</name>*
> *    <value>name-node:54311</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.tasktracker.map.tasks.maximum</name>*
> *    <value>1</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.tasktracker.reduce.tasks.maximum</name>*
> *    <value>1</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.compress.map.output</name>*
> *    <value>true</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.map.output.compression.codec</name>*
> *    <value>org.apache.hadoop.io.compress.GzipCodec</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.child.java.opts</name>*
> *    <value>-Xmx400m</value>*
> *  </property>*
> *  <property>*
> *    <name>map.sort.class</name>*
> *    <value>org.apache.hadoop.util.HeapSort</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.reduce.slowstart.completed.maps</name>*
> *    <value>0.85</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.map.tasks.speculative.execution</name>*
> *    <value>false</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.reduce.tasks.speculative.execution</name>*
> *    <value>false</value>*
> *  </property>*
> *</configuration>*
>
> On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi 
> <[email protected]>wrote:
>
>> Slow start is an important parameter. Definitely impacts job runtime. My
>> experience in the past has been that, setting this parameter to too low or
>> setting to too high can have issues with job latencies. If you are trying to
>> run same job then its easy to set right value but if your cluster is
>> multi-tenancy then getting this to right requires some benchmarking of
>> different workloads concurrently.
>>
>> But you case is interesting, you are running on a single core(How many
>> disks per node?). So setting to higher side of the spectrum as suggested by
>> Joey makes sense.
>>
>>
>> -Bharath
>>
>>
>>
>>
>>
>> ________________________________
>> From: Joey Echeverria <[email protected]>
>> To: [email protected]
>> Sent: Friday, July 8, 2011 9:14 AM
>> Subject: Re: Cluster Tuning
>>
>> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
>> 1.0 means the maps have to completely finish before the reduce starts
>> copying any data. I often run jobs with this set to .90-.95.
>>
>> -Joey
>>
>> On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[email protected]> wrote:
>> > Here's another thought. I realized that the reduce operation in my
>> > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
>> > mappers end. Is there a way to configure the cluster to make the reduce
>> wait
>> > for the map operations to complete? Specially considering my hardware
>> > restraints
>> >
>> > Thanks!
>> > Pony
>> >
>> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote:
>> >
>> >> Hey guys,
>> >> Thanks all of you for your help.
>> >>
>> >> Joey,
>> >> I tweaked my MapReduce to serialize/deserialize only escencial values
>> and
>> >> added a combiner and that helped a lot. Previously I had a domain
>> object
>> >> which was being passed between Mapper and Reducer when I only needed a
>> >> single value.
>> >>
>> >> Esteban,
>> >> I think you underestimate the constraints of my cluster. Adding
>> multiple
>> >> jobs per JVM really kills me in terms of memory. Not to mention that by
>> >> having a single core there's not much to gain in terms of paralelism
>> (other
>> >> than perhaps while a process is waiting of an I/O operation). Still I
>> gave
>> >> it a shot, but even though I kept changing the config I always ended
>> with a
>> >> Java heap space error.
>> >>
>> >> Is it me or performance tuning is mostly a per job task? I mean it
>> will, in
>> >> the end, depend on the the data you are processing (structure, size,
>> weather
>> >> it's in one file or many, etc). If my jobs have different sets of data,
>> >> which are in different formats and organized in different  file
>> structures,
>> >> Do you guys recommend moving some of the configuration to Java code?
>> >>
>> >> Thanks!
>> >> Pony
>> >>
>> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote:
>> >>
>> >>> Eres el Esteban que conozco?
>> >>>
>> >>>
>> >>>
>> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]>
>> >>> escribió:
>> >>>
>> >>> > Hi Pony,
>> >>> >
>> >>> > There is a good chance that your boxes are doing some heavy swapping
>> and
>> >>> > that is a killer for Hadoop.  Have you tried
>> >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
>> the
>> >>> > heap on that boxes?
>> >>> >
>> >>> > Cheers,
>> >>> > Esteban.
>> >>> >
>> >>> > --
>> >>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]>
>> wrote:
>> >>> >
>> >>> >> Hi guys!
>> >>> >>
>> >>> >> I'd like some help fine tuning my cluster. I currently have 20
>> boxes
>> >>> >> exactly
>> >>> >> alike. Single core machines with 600MB of RAM. No chance of
>> upgrading
>> >>> the
>> >>> >> hardware.
>> >>> >>
>> >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >>> >> DataNode/TaskTracker boxes.
>> >>> >>
>> >>> >> All my config is default except i've set the following in my
>> >>> >> mapred-site.xml
>> >>> >> in an effort to try and prevent choking my boxes.
>> >>> >> *<property>*
>> >>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>> >>> >> *      <value>1</value>*
>> >>> >> *  </property>*
>> >>> >>
>> >>> >> I'm running a MapReduce job which reads a Proxy Server log file
>> (2GB),
>> >>> maps
>> >>> >> hosts to each record and then in the reduce task it accumulates the
>> >>> amount
>> >>> >> of bytes received from each host.
>> >>> >>
>> >>> >> Currently it's producing about 65000 keys
>> >>> >>
>> >>> >> The hole job takes forever to complete, specially the reduce part.
>> I've
>> >>> >> tried different tuning configs by I can't bring it down under
>> 20mins.
>> >>> >>
>> >>> >> Any ideas?
>> >>> >>
>> >>> >> Thanks for your help!
>> >>> >> Pony
>> >>> >>
>> >>>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>
>

Reply via email to