Here's a consequence that I see of having the values be much larger than the keys: there's not much point in me adding a combiner.
My mapper emits pairs of the form: <Key, Value> where the size of value is much greater than the size of Key. The reducer then processes input of the form: <Key, Iterator<Value>> The reducer then looks at the set of values corresponding to a Key and separates it into one of two bins. I don't think this is particularly CPU-intensive, however, the reducer needs access to the entire set of Values. The set can't be boiled down into some smaller sufficient statistic the way, say, in a word count program we can combine the counts for a word from different documents into a single number. As a result, the only combiner strategy I can see is to have the mapper emit a Value as a single item list: <Key, [Value]> Have a combiner combine the lists: <Key, [Value, Value...] and then the reducer would work on lists of lists. <Key, Iterator<[Value, Value...]>> This would save on redundant Key IO, but since Values are so much bigger than Keys I don't think this would matter.
