The Map and Reduce functions of MapReduce
are both defined with respect to data structured in (key, value) pairs.
Map takes one pair of data with a type on a data domain, and
returns a list of pairs in a different domain:
Map(k1,v1) -> list(k2,v2)
The map function is applied in parallel to every item in the input
dataset. This produces a list of (k2,v2) pairs for each call. After
that, the MapReduce framework collects all pairs with the same key from
all lists and groups them together, thus creating one group for each
one of the different generated keys.
The Reduce function is then applied in parallel to each
group, which in turn produces a collection of values in the same domain:
Reduce(k2, list (v2)) -> list(v2)
Each Reduce call typically produces either one value v2 or
an
empty return, though one call is allowed to return more than one value.
The returns of all calls are collected as the desired result list.
Thus the MapReduce framework transforms a list of (key, value) pairs
into a list of values. This behavior is different from the functional
programming map and reduce combination, which accepts a list of
arbitrary values and returns one single value that combines all
the values returned by map.
It is necessary but not sufficient to have implementations of the
map and reduce abstractions in order to implement MapReduce.
Furthermore effective implementations of MapReduce require a
distributed file system to connect the processes performing the Map and
Reduce phases.
Example
The canonical example application of MapReduce is a process to count
the appearances of each different word in a set of documents:
map(String name, String document):
// key: document name
// value: document contents
for each word w in document:
EmitIntermediate(w, 1);
reduce(String word, Iterator partialCounts):
// key: a word
// values: a list of aggregated partial counts
int result = 0;
for each v in partialCounts:
result += ParseInt(v);
Emit(result);
Here, each document is split in words, and each word is counted
initially with a "1" value by the Map
function, using the word as the result key. The framework puts together
all the pairs with the same key and feeds them to the same call to Reduce,
thus this function just needs to sum all of its input values to find
the total appearances of that word.