Re: Hadoop overhead

Johan Oskarsson Wed, 16 Jan 2008 02:51:45 -0800

I simply followed the wiki "The right level of parallelism for mapsseems to be around 10-100 maps/node",http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces

We have 8 cores in each machine, so perhaps 100 mappers ought to beright, it's set to 157 in the config but hadoop used ~200 for the job,don't know why. That would of course help in this case, but what aboutwhen we process large datasets? Especially if a mapper fails.


Reducers I also setup to use ~1 per core, slightly less.

/Johan

Ted Dunning wrote:

Why so many mappers and reducers relative to the number of machines you
have?  This just causes excess heartache when running the job.

My standard practice is to run with a small factor larger than the number of
cores that I have (for instance 3 tasks on a 2 core machine).  In fact, I
find it most helpful to have the cluster defaults rule the choice except in
a few cases where I want one reducer or a few more than the standard 4
reducers.


On 1/15/08 9:15 AM, "Johan Oskarsson" <[EMAIL PROTECTED]> wrote:

Hi.

I believe someone posted about this a while back, but it's worth
mentioning again.

I just ran a job on our 10 node cluster where the input data was
~70 empty sequence files, with our default settings this ran about ~200
mappers and ~70 reducers.

The job took almost exactly two minutes to finish.

How can we reduce this overhead?

* Pick number of mappers and reducers in a more dynamic way,
   depending on the size of the input?
* JVM reuse, one jvm per job instead of one per task?

Any other ideas?

/Johan

Re: Hadoop overhead

Reply via email to