Jane, i think you have mapred.tasktracker.reduce.tasks.maximum or
mapred.reduce.tasks set to 1 in your local, and have them set to some other
values in the emr, that's why you always get one reducer in your local and not
on the emr. CheersRamon
> Date: Thu, 8 Mar 2012 21:30:26 -0500
> Subject: does hadoop always respect setNumReduceTasks?
> From: jane.wayne2...@gmail.com
> To: common-user@hadoop.apache.org
>
> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
>
> as i am emitting items from the mapper, i expect/desire only 1 reducer to
> get these items because i want to assign each key of the key-value input
> pair a unique integer id. if i had 1 reducer, i can just keep a local
> counter (with respect to the reducer instance) and increment it.
>
> on my local hadoop cluster, i noticed that most, if not all, my jobs have
> only 1 reducer, regardless of whether or not i set
> Job.setNumReduceTasks(int).
>
> however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
> i notice that there are multiple reducers. if i set the number of reduce
> tasks to 1, is this always guaranteed? i ask because i don't know if there
> is a gotcha like the combiner (where it may or may not run at all).
>
> also, it looks like this might not be a good idea just having 1 reducer (it
> won't scale). it is most likely better if there are +1 reducers, but in
> that case, i lose the ability to assign unique numbers to the key-value
> pairs coming in. is there a design pattern out there that addresses this
> issue?
>
> my mapper/reducer key-value pair signatures looks something like the
> following.
>
> mapper(Text, Text, Text, IntWritable)
> reducer(Text, IntWritable, IntWritable, Text)
>
> the mapper reads a sequence file whose key-value pairs are of type Text and
> Text. i then emit Text (let's say a word) and IntWritable (let's say
> frequency of the word).
>
> the reducer gets the word and its frequencies, and then assigns the word an
> integer id. it emits IntWritable (the id) and Text (the word).
>
> i remember seeing code from mahout's API where they assign integer ids to
> items. the items were already given an id of type long. the conversion they
> make is as follows.
>
> public static int idToIndex(long id) {
> return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32));
> }
>
> is there something equivalent for Text or a "word"? i was thinking about
> simply taking the hash value of the string/word, but of course, different
> strings can map to the same hash value.