Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. 1.0 means the maps have to completely finish before the reduce starts copying any data. I often run jobs with this set to .90-.95.
-Joey On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <[email protected]> wrote: > Here's another thought. I realized that the reduce operation in my > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the > mappers end. Is there a way to configure the cluster to make the reduce wait > for the map operations to complete? Specially considering my hardware > restraints > > Thanks! > Pony > > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote: > >> Hey guys, >> Thanks all of you for your help. >> >> Joey, >> I tweaked my MapReduce to serialize/deserialize only escencial values and >> added a combiner and that helped a lot. Previously I had a domain object >> which was being passed between Mapper and Reducer when I only needed a >> single value. >> >> Esteban, >> I think you underestimate the constraints of my cluster. Adding multiple >> jobs per JVM really kills me in terms of memory. Not to mention that by >> having a single core there's not much to gain in terms of paralelism (other >> than perhaps while a process is waiting of an I/O operation). Still I gave >> it a shot, but even though I kept changing the config I always ended with a >> Java heap space error. >> >> Is it me or performance tuning is mostly a per job task? I mean it will, in >> the end, depend on the the data you are processing (structure, size, weather >> it's in one file or many, etc). If my jobs have different sets of data, >> which are in different formats and organized in different file structures, >> Do you guys recommend moving some of the configuration to Java code? >> >> Thanks! >> Pony >> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote: >> >>> Eres el Esteban que conozco? >>> >>> >>> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]> >>> escribió: >>> >>> > Hi Pony, >>> > >>> > There is a good chance that your boxes are doing some heavy swapping and >>> > that is a killer for Hadoop. Have you tried >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the >>> > heap on that boxes? >>> > >>> > Cheers, >>> > Esteban. >>> > >>> > -- >>> > Get Hadoop! http://www.cloudera.com/downloads/ >>> > >>> > >>> > >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]> wrote: >>> > >>> >> Hi guys! >>> >> >>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes >>> >> exactly >>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading >>> the >>> >> hardware. >>> >> >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19 >>> >> DataNode/TaskTracker boxes. >>> >> >>> >> All my config is default except i've set the following in my >>> >> mapred-site.xml >>> >> in an effort to try and prevent choking my boxes. >>> >> *<property>* >>> >> * <name>mapred.tasktracker.map.tasks.maximum</name>* >>> >> * <value>1</value>* >>> >> * </property>* >>> >> >>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB), >>> maps >>> >> hosts to each record and then in the reduce task it accumulates the >>> amount >>> >> of bytes received from each host. >>> >> >>> >> Currently it's producing about 65000 keys >>> >> >>> >> The hole job takes forever to complete, specially the reduce part. I've >>> >> tried different tuning configs by I can't bring it down under 20mins. >>> >> >>> >> Any ideas? >>> >> >>> >> Thanks for your help! >>> >> Pony >>> >> >>> >> >> > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
