Hello,

Great observation, here's a hack that may be helpful
until such HDFS functionality is included.

You can put static Java collections inside of your class which
implements Reducer but outside of your reduce method to fix.

TreeMaps are good for this or (if(!HashMap.containsKey())).

Iterate thru your intermediate values in the while loop
placing items in these inner dictionary structures. After
the while loop (still inside reduce) get collections data,
and send to OutputCollector with session id as
WritableComparable key and Text value.

Regards,

Peter W.

On Aug 22, 2007, at 10:55 AM, Ted Dunning wrote:


I am finding that it is a common pattern that multi-phase map-reduce
programs I need to write very often have nearly degenerate map functions in second and later map-reduce phases. The only need for these function is to select the next reduce key and very often, a local combiner can be used to
greatly decrease the number of records passed to the second reduce.

It isn't hard to implement these programs as multiple fully fledged
map-reduces, but it appears to me that many of them would be better
expressed as something more like a map-reduce-reduce program.

For example, take the problem of coocurrence counting in log records. The first map would extract a user id and an object id and group on user id. The second reduce would take entire sessions for a single user and generate
co-occurrence pairs as keys for the second reduce, each with a count
determined by the frequency of the objects in the user history. The second reduce (and local combiner) would aggregate these counts and discard items
with small counts.

Expressed conventionally, this would have write all of the user sessions to HDFS and a second map phase would generate the pairs for counting. The opportunity for efficiency would come from the ability to avoid writing
intermediate results to the distributed data store.

Has anybody looked at whether this would help and whether it would be hard
to do?



Reply via email to