W.P,
Sounds like you are going to be taking a long time no matter what. With a
keyspace of about 10^7 that means that either hadoop is going to eventually
allocate 10^7 reducers (if you set you reducer count to 10^7) or is going to
re-use the ones you have 10^6 / (number of reducers you allocate) times. It
is probably just a big job :)
Look into fairscheduler or specify less reducers for this job and suffer a
slight slowdown, but allow other jobs to get reducers when they need them.
You *might* get some efficiencies if you can reduce the number of keys, or
ensure that very few keys are getting big lists of data (anti-parallel). Make
sure you are using a combiner if there is an opportunity to reduce the amount
of data that goes through the shuffle. That is always a good thing IO = slow.
Also, see if you can break your job up into smaller pieces so the more
expensive operations are happening on less data volume.
Good luck!
Cheers
James.
On 2011-05-18, at 3:42 PM, W.P. McNeill wrote:
> Altogether my reducers are handling about 10^8 keys. The number of values
> per key varies, but ranges from 1-100. I'd guess the mean and mode is
> around 10, but I'm not sure.