I am finding that it is a common pattern that multi-phase map-reduce
programs I need to write very often have nearly degenerate map functions in
second and later map-reduce phases.  The only need for these function is to
select the next reduce key and very often, a local combiner can be used to
greatly decrease the number of records passed to the second reduce.

It isn't hard to implement these programs as multiple fully fledged
map-reduces, but it appears to me that many of them would be better
expressed as something more like a map-reduce-reduce program.

For example, take the problem of coocurrence counting in log records.  The
first map would extract a user id and an object id and group on user id.
The second reduce would take entire sessions for a single user and generate
co-occurrence pairs as keys for the second reduce, each with a count
determined by the frequency of the objects in the user history.  The second
reduce (and local combiner) would aggregate these counts and discard items
with small counts. 

Expressed conventionally, this would have write all of the user sessions to
HDFS and a second map phase would generate the pairs for counting.  The
opportunity for efficiency would come from the ability to avoid writing
intermediate results to the distributed data store.
    
Has anybody looked at whether this would help and whether it would be hard
to do?


Reply via email to