Re: computing conditional probabilities with Hadoop?

Owen O'Malley Mon, 01 Oct 2007 21:41:15 -0700

On Oct 1, 2007, at 6:05 PM, Chris Dyer wrote:

As of right now, I'm still having trouble determining how I can force
the first element of the set that will be iterated over by a single
reducer to be the marginal, and not some individual count.  Does
anyone know if Hadoop guarantees (can be made to guarantee) that the
relative order of keys that are equal will be left unchanged?  If so,
this would be a fairly easy solution.

There is not a guarantee of the reduce sort being stable in anysense. (WIth the non-deterministic order of the map outputs beingavailable to the reduce, it wouldn't make that much sense.)

There certainly isn't enough documentation about what is allowed forsorting. I've filed a bug HADOOP-1981 to expand the Reducer java docto mention the JobConf methods that can control the sort order. Inparticular, the methods are:


setOutputKeyComparatorClass
setOutputValueGroupingComparator

The first comparator controls the sort order of the keys. The secondcontrols which keys are grouped together into a single call to thereduce method. The combination of these two allows you to set up jobsthat act like you've defined an order on the values.

For example, say that you want to find duplicate web pages and tagthem all with the url of the "best" known example. You would set upthe job like:


Map Input Key: url
Map Input Value: document
Map Output Key: document checksum, url pagerank
Map Output Value: url
Partitioner: by checksum
OutputKeyComparator: by checksum and then decreasing pagerank
OutputValueGroupingComparator: by checksum

with this setup, the reduce function will be called exactly once witheach checksum, but the first value from the iterator will be the onewith the highest pagerank, which can then be used to tag the otherentries of the checksum family.


-- Owen

Re: computing conditional probabilities with Hadoop?

Reply via email to