W.P,

Sounds like you are going to be taking a long time no matter what.  With a 
keyspace of about 10^7 that means that either hadoop is going to eventually 
allocate 10^7 reducers (if you set you reducer count to 10^7) or is going to 
re-use the ones you have 10^6 / (number of reducers you allocate) times.   It 
is probably just a big job :)

Look into fairscheduler or specify less reducers for this job and suffer a 
slight slowdown, but allow other jobs to get reducers when they need them.

You *might* get some efficiencies if you can reduce the number of keys, or 
ensure that very few keys are getting big lists of data (anti-parallel).  Make 
sure you are using a combiner if there is an opportunity to reduce the amount 
of data that goes through the shuffle.  That is always a good thing  IO = slow.

Also, see if you can break  your job up into smaller pieces so the more 
expensive operations are happening on less data volume.

Good luck!

Cheers
James.


        
On 2011-05-18, at 3:42 PM, W.P. McNeill wrote:

> Altogether my reducers are handling about 10^8 keys.  The number of values
> per key varies, but ranges from 1-100.  I'd guess the mean and mode is
> around 10, but I'm not sure.

Reply via email to