Hello,

On Fri, Mar 25, 2011 at 12:48 PM, Keith Wiley <[email protected]> wrote:
> Say my mappers produce at most (or precisely) 4 output keys.  Say I designate 
> the job to have at least (or precisely) 4 reducers.  I have noticed that it 
> is not guaranteed that all four reducers will be used, one per key.  Rather, 
> it is entirely likely that one reducer won't be used at all and another will 
> receive two sets of keys, first receiving all values of one key, then all 
> values of the other key.

This is merely the side effect of using a hash-based partitioner when
your partitioning can be manually defined. Using a hash-partitioner,
by its nature of using hashes, cannot guarantee equality among
partitioners (but are very good to use for randomly distributed keys
most of the times).

> I have been told that the only way to achieve a more ideal distribution of 
> work is to write my own partitioner.  I'm willing to do that, we've done it 
> before within our group on this project, but I don't want to do any 
> unnecessary work.  I'm mildly surprised that there isn't a configuration 
> setting that will achieve my desired goal here.  Was the advice I received 
> correct?  Can my goal only be achieved by writing a fresh partitioner from 
> scratch?

I think the advice was very correct.

If you ask me, deciding a Partitioner is as much important in a job as
anything else. Though HashPartitioner is provided as a (mostly) good
default, it is worth spending time and thought on a partitioner that
better fits your job process or the data being processed. I suppose it
could help if Hadoop provided something like what you seek (a
concretely defined set of keys mapping directly to their partitions),
but it is pretty trivial to write one, and would also be difficult to
generalize among all the types that get used as keys. A neat example
would be the TotalOrderPartitioner used for terasort; they had to use
it for getting a globally sorted output data - but not all forms of
jobs otherwise run for ETL/etc. would require or benefit from such a
thing. This is another reason why Partitioners are pluggable (to
encourage you to design your own, if need be).

-- 
Harsh J
http://harshj.com

Reply via email to