Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints
Thanks! Pony On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote: > Hey guys, > Thanks all of you for your help. > > Joey, > I tweaked my MapReduce to serialize/deserialize only escencial values and > added a combiner and that helped a lot. Previously I had a domain object > which was being passed between Mapper and Reducer when I only needed a > single value. > > Esteban, > I think you underestimate the constraints of my cluster. Adding multiple > jobs per JVM really kills me in terms of memory. Not to mention that by > having a single core there's not much to gain in terms of paralelism (other > than perhaps while a process is waiting of an I/O operation). Still I gave > it a shot, but even though I kept changing the config I always ended with a > Java heap space error. > > Is it me or performance tuning is mostly a per job task? I mean it will, in > the end, depend on the the data you are processing (structure, size, weather > it's in one file or many, etc). If my jobs have different sets of data, > which are in different formats and organized in different file structures, > Do you guys recommend moving some of the configuration to Java code? > > Thanks! > Pony > > On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote: > >> Eres el Esteban que conozco? >> >> >> >> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]> >> escribió: >> >> > Hi Pony, >> > >> > There is a good chance that your boxes are doing some heavy swapping and >> > that is a killer for Hadoop. Have you tried >> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the >> > heap on that boxes? >> > >> > Cheers, >> > Esteban. >> > >> > -- >> > Get Hadoop! http://www.cloudera.com/downloads/ >> > >> > >> > >> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]> wrote: >> > >> >> Hi guys! >> >> >> >> I'd like some help fine tuning my cluster. I currently have 20 boxes >> >> exactly >> >> alike. Single core machines with 600MB of RAM. No chance of upgrading >> the >> >> hardware. >> >> >> >> My cluster is made out of 1 NameNode/JobTracker box and 19 >> >> DataNode/TaskTracker boxes. >> >> >> >> All my config is default except i've set the following in my >> >> mapred-site.xml >> >> in an effort to try and prevent choking my boxes. >> >> *<property>* >> >> * <name>mapred.tasktracker.map.tasks.maximum</name>* >> >> * <value>1</value>* >> >> * </property>* >> >> >> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB), >> maps >> >> hosts to each record and then in the reduce task it accumulates the >> amount >> >> of bytes received from each host. >> >> >> >> Currently it's producing about 65000 keys >> >> >> >> The hole job takes forever to complete, specially the reduce part. I've >> >> tried different tuning configs by I can't bring it down under 20mins. >> >> >> >> Any ideas? >> >> >> >> Thanks for your help! >> >> Pony >> >> >> > >
