Hello,
Great observation, here's a hack that may be helpful
until such HDFS functionality is included.
You can put static Java collections inside of your class which
implements Reducer but outside of your reduce method to fix.
TreeMaps are good for this or (if(!HashMap.containsKey())).
Iterate thru your intermediate values in the while loop
placing items in these inner dictionary structures. After
the while loop (still inside reduce) get collections data,
and send to OutputCollector with session id as
WritableComparable key and Text value.
Regards,
Peter W.
On Aug 22, 2007, at 10:55 AM, Ted Dunning wrote:
I am finding that it is a common pattern that multi-phase map-reduce
programs I need to write very often have nearly degenerate map
functions in
second and later map-reduce phases. The only need for these
function is to
select the next reduce key and very often, a local combiner can be
used to
greatly decrease the number of records passed to the second reduce.
It isn't hard to implement these programs as multiple fully fledged
map-reduces, but it appears to me that many of them would be better
expressed as something more like a map-reduce-reduce program.
For example, take the problem of coocurrence counting in log
records. The
first map would extract a user id and an object id and group on
user id.
The second reduce would take entire sessions for a single user and
generate
co-occurrence pairs as keys for the second reduce, each with a count
determined by the frequency of the objects in the user history.
The second
reduce (and local combiner) would aggregate these counts and
discard items
with small counts.
Expressed conventionally, this would have write all of the user
sessions to
HDFS and a second map phase would generate the pairs for counting.
The
opportunity for efficiency would come from the ability to avoid
writing
intermediate results to the distributed data store.
Has anybody looked at whether this would help and whether it would
be hard
to do?