I am not quite sure what you mean by "this".
If you mean that the second approach is only an approximation, then you are
correct.
The only simple correct algorithm that I know of is to do the counts
(correctly) and then do the main show (processing with a kill list).
On 4/16/08 9:04 PM, "Amar Kamat" <[EMAIL PROTECTED]> wrote:
> Ted Dunning wrote:
>> The easiest solution is to not worry too much about running an extra MR
>> step.
>>
>> So,
>>
>> - run a first pass to get the counts. Use word count as the pattern. Store
>> the results in a file.
>>
>> - run the second pass. You can now read the hash-table from the file you
>> stored in pass 1.
>>
>> Another approach is to do the counting in your maps as specified and then
>> before exiting, you can emit special records for each key to suppress. With
>> the correct sort and partition functions, you can make these killer records
>> appear first in the reduce input. Then, if your reducer sees the kill flag
>> in the front of the values, it can avoid processing any extra data.
>>
>>
> Ted,
> Will this work for the case where the cutoff frequency/count requires a
> global picture? I guess not.