Ted Dunning wrote:
I am not quite sure what you mean by "this".
If you mean that the second approach is only an approximation, then you are
correct.
Yes.
The only simple correct algorithm that I know of is to do the counts
(correctly) and then do the main show (processing with a kill list).
On 4/16/08 9:04 PM, "Amar Kamat" <[EMAIL PROTECTED]> wrote:
Ted Dunning wrote:
The easiest solution is to not worry too much about running an extra MR
step.
So,
- run a first pass to get the counts. Use word count as the pattern. Store
the results in a file.
- run the second pass. You can now read the hash-table from the file you
stored in pass 1.
Another approach is to do the counting in your maps as specified and then
before exiting, you can emit special records for each key to suppress. With
the correct sort and partition functions, you can make these killer records
appear first in the reduce input. Then, if your reducer sees the kill flag
in the front of the values, it can avoid processing any extra data.
Ted,
Will this work for the case where the cutoff frequency/count requires a
global picture? I guess not.