> Suppose a batch of inputsplits arrive in the beginning to every map, and
> reduce gives the word, frequency for this batch of inputsplits.
> Now after this another batch of inputsplits arrive and the results from
> subsequent reduce are aggregated to the previous results(if the word "that"
> has frequency 2 in previous processing and in this processing it occurs 1
> time, then the frequency of "that" is now maintained as 3).
> In next map-reduce "that" comes 4 times, now its frequency maintained as
> 7....
>
you could merge the result from the previous step in the reducer. If the no of 
unique words are not large,  the output from the previous step can be loaded in 
the memory hash. This can be used to add the count from previous step to the 
current step.
In case you expect the unique words list to be large to fit in memory. You 
could read the previous step output directly from the hdfs and since it would 
be a sorted file you could just walk it and merge the count in single pass in 
the reduce function.

- Sharad

Reply via email to