On Oct 1, 2007, at 6:05 PM, Chris Dyer wrote:
As of right now, I'm still having trouble determining how I can force
the first element of the set that will be iterated over by a single
reducer to be the marginal, and not some individual count. Does
anyone know if Hadoop guarantees (can be made to guarantee) that the
relative order of keys that are equal will be left unchanged? If so,
this would be a fairly easy solution.
There is not a guarantee of the reduce sort being stable in any
sense. (WIth the non-deterministic order of the map outputs being
available to the reduce, it wouldn't make that much sense.)
There certainly isn't enough documentation about what is allowed for
sorting. I've filed a bug HADOOP-1981 to expand the Reducer java doc
to mention the JobConf methods that can control the sort order. In
particular, the methods are:
setOutputKeyComparatorClass
setOutputValueGroupingComparator
The first comparator controls the sort order of the keys. The second
controls which keys are grouped together into a single call to the
reduce method. The combination of these two allows you to set up jobs
that act like you've defined an order on the values.
For example, say that you want to find duplicate web pages and tag
them all with the url of the "best" known example. You would set up
the job like:
Map Input Key: url
Map Input Value: document
Map Output Key: document checksum, url pagerank
Map Output Value: url
Partitioner: by checksum
OutputKeyComparator: by checksum and then decreasing pagerank
OutputValueGroupingComparator: by checksum
with this setup, the reduce function will be called exactly once with
each checksum, but the first value from the iterator will be the one
with the highest pagerank, which can then be used to tag the other
entries of the checksum family.
-- Owen