Completely agree. We are seeing the same pattern - need a series of map-reduce jobs for most stuff. There are a few different alternatives that may help:
1. The output of the intermediate reduce phases can be written to files that are not replicated. Not sure whether we can do this through map-reduce - but hdfs seems to be able to set replication level per file. 2. Map tasks of the next step are streamed data directly from preceding reduce tasks. This is more along the lines Ted is suggesting - make iterative map-reduce a primitive natively supported in Hadoop. This is probably a better solution - but more work? I am sure this has been encountered in other scenarios (heck - I am just a month into using hadoop) - so would be interested to know what other people are thinking and whether there are any upcoming features to support this programming paradigm .. Joydeep -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 22, 2007 10:56 AM To: [email protected] Subject: Poly-reduce? I am finding that it is a common pattern that multi-phase map-reduce programs I need to write very often have nearly degenerate map functions in second and later map-reduce phases. The only need for these function is to select the next reduce key and very often, a local combiner can be used to greatly decrease the number of records passed to the second reduce. It isn't hard to implement these programs as multiple fully fledged map-reduces, but it appears to me that many of them would be better expressed as something more like a map-reduce-reduce program. For example, take the problem of coocurrence counting in log records. The first map would extract a user id and an object id and group on user id. The second reduce would take entire sessions for a single user and generate co-occurrence pairs as keys for the second reduce, each with a count determined by the frequency of the objects in the user history. The second reduce (and local combiner) would aggregate these counts and discard items with small counts. Expressed conventionally, this would have write all of the user sessions to HDFS and a second map phase would generate the pairs for counting. The opportunity for efficiency would come from the ability to avoid writing intermediate results to the distributed data store. Has anybody looked at whether this would help and whether it would be hard to do?
