> Suppose a batch of inputsplits arrive in the beginning to every map, and > reduce gives the word, frequency for this batch of inputsplits. > Now after this another batch of inputsplits arrive and the results from > subsequent reduce are aggregated to the previous results(if the word "that" > has frequency 2 in previous processing and in this processing it occurs 1 > time, then the frequency of "that" is now maintained as 3). > In next map-reduce "that" comes 4 times, now its frequency maintained as > 7.... > you could merge the result from the previous step in the reducer. If the no of unique words are not large, the output from the previous step can be loaded in the memory hash. This can be used to add the count from previous step to the current step. In case you expect the unique words list to be large to fit in memory. You could read the previous step output directly from the hdfs and since it would be a sorted file you could just walk it and merge the count in single pass in the reduce function.
- Sharad