this is a well known problem. basically, you want to aggregate values
computed at some previous step.
--emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category
(it is the same task as summing-up the word counts)
Miles
2008/10/7 Edward J. Yoon <[EMAIL PROTECTED]>:
> I would like to get the spam probability P(word|category) of the words
> from an files of category (bad/good e-mails) as describe below. BTW,
> To computes it on reduce, I need a sum of "spamTotal" between map
> tasks. How can i get it?
>
> Map:
>
> /**
> * Counts word frequency
> */
> public void map(LongWritable key, Text value,
> OutputCollector<Text, FloatWritable> output, Reporter reporter)
> throws IOException {
> String line = value.toString();
> String[] tokens = line.split(splitregex);
>
> // For every word token
> for (int i = 0; i < tokens.length; i++) {
> String word = tokens[i].toLowerCase();
> Matcher m = wordregex.matcher(word);
> if (m.matches()) {
> spamTotal++;
> output.collect(new Text(word), count);
> }
> }
> }
> }
>
> Reduce:
>
> /**
> * Computes bad count / total bad words
> */
> public static class Reduce extends MapReduceBase implements
> Reducer<Text, FloatWritable, Text, FloatWritable> {
>
> public void reduce(Text key, Iterator<FloatWritable> values,
> OutputCollector<Text, FloatWritable> output, Reporter reporter)
> throws IOException {
> int sum = 0;
> while (values.hasNext()) {
> sum += (int) values.next().get();
> }
>
> FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
> output.collect(key, badProb);
> }
> }
>
>
> --
> Best regards, Edward J. Yoon
> [EMAIL PROTECTED]
> http://blog.udanax.org
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.