Hello, On Fri, Mar 25, 2011 at 12:48 PM, Keith Wiley <[email protected]> wrote: > Say my mappers produce at most (or precisely) 4 output keys. Say I designate > the job to have at least (or precisely) 4 reducers. I have noticed that it > is not guaranteed that all four reducers will be used, one per key. Rather, > it is entirely likely that one reducer won't be used at all and another will > receive two sets of keys, first receiving all values of one key, then all > values of the other key.
This is merely the side effect of using a hash-based partitioner when your partitioning can be manually defined. Using a hash-partitioner, by its nature of using hashes, cannot guarantee equality among partitioners (but are very good to use for randomly distributed keys most of the times). > I have been told that the only way to achieve a more ideal distribution of > work is to write my own partitioner. I'm willing to do that, we've done it > before within our group on this project, but I don't want to do any > unnecessary work. I'm mildly surprised that there isn't a configuration > setting that will achieve my desired goal here. Was the advice I received > correct? Can my goal only be achieved by writing a fresh partitioner from > scratch? I think the advice was very correct. If you ask me, deciding a Partitioner is as much important in a job as anything else. Though HashPartitioner is provided as a (mostly) good default, it is worth spending time and thought on a partitioner that better fits your job process or the data being processed. I suppose it could help if Hadoop provided something like what you seek (a concretely defined set of keys mapping directly to their partitions), but it is pretty trivial to write one, and would also be difficult to generalize among all the types that get used as keys. A neat example would be the TotalOrderPartitioner used for terasort; they had to use it for getting a globally sorted output data - but not all forms of jobs otherwise run for ETL/etc. would require or benefit from such a thing. This is another reason why Partitioners are pluggable (to encourage you to design your own, if need be). -- Harsh J http://harshj.com
