Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <[email protected]> wrote:

> Hey guys,
> Thanks all of you for your help.
>
> Joey,
> I tweaked my MapReduce to serialize/deserialize only escencial values and
> added a combiner and that helped a lot. Previously I had a domain object
> which was being passed between Mapper and Reducer when I only needed a
> single value.
>
> Esteban,
> I think you underestimate the constraints of my cluster. Adding multiple
> jobs per JVM really kills me in terms of memory. Not to mention that by
> having a single core there's not much to gain in terms of paralelism (other
> than perhaps while a process is waiting of an I/O operation). Still I gave
> it a shot, but even though I kept changing the config I always ended with a
> Java heap space error.
>
> Is it me or performance tuning is mostly a per job task? I mean it will, in
> the end, depend on the the data you are processing (structure, size, weather
> it's in one file or many, etc). If my jobs have different sets of data,
> which are in different formats and organized in different  file structures,
> Do you guys recommend moving some of the configuration to Java code?
>
> Thanks!
> Pony
>
> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <[email protected]> wrote:
>
>> Eres el Esteban que conozco?
>>
>>
>>
>> El 07/07/2011, a las 15:53, Esteban Gutierrez <[email protected]>
>> escribió:
>>
>> > Hi Pony,
>> >
>> > There is a good chance that your boxes are doing some heavy swapping and
>> > that is a killer for Hadoop.  Have you tried
>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>> > heap on that boxes?
>> >
>> > Cheers,
>> > Esteban.
>> >
>> > --
>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >
>> >
>> >
>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <[email protected]> wrote:
>> >
>> >> Hi guys!
>> >>
>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> >> exactly
>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>> the
>> >> hardware.
>> >>
>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >> DataNode/TaskTracker boxes.
>> >>
>> >> All my config is default except i've set the following in my
>> >> mapred-site.xml
>> >> in an effort to try and prevent choking my boxes.
>> >> *<property>*
>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>> >> *      <value>1</value>*
>> >> *  </property>*
>> >>
>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>> maps
>> >> hosts to each record and then in the reduce task it accumulates the
>> amount
>> >> of bytes received from each host.
>> >>
>> >> Currently it's producing about 65000 keys
>> >>
>> >> The hole job takes forever to complete, specially the reduce part. I've
>> >> tried different tuning configs by I can't bring it down under 20mins.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks for your help!
>> >> Pony
>> >>
>>
>
>

Reply via email to