On Aug 22, 2007, at 10:55 AM, Ted Dunning wrote:
I am finding that it is a common pattern that multi-phase map-reduce
programs I need to write very often have nearly degenerate map
functions in
second and later map-reduce phases. The only need for these
function is to
select the next reduce key and very often, a local combiner can be
used to
greatly decrease the number of records passed to the second reduce.
My opinion is that handling these kinds of patterns in the framework
itself is a mistake. It would introduce a lot of complexity and the
payback would be relatively slight in terms of the application. I'd
much rather have the Hadoop framework support the single primitive
(map/reduce) very well and build a layer on top that provides a very
general algebra over map/reduce operations. One early example of this
is Pig (http://research.yahoo.com/project/pig).
-- Owen