this is a well known problem.  basically, you want to aggregate values
computed at some previous step.

--emit <category,probability> pairs and have the reducer simply sum-up
the probabilities for a given category

(it is the same task as summing-up the word counts)

Miles

2008/10/7 Edward J. Yoon <[EMAIL PROTECTED]>:
> I would like to get the spam probability P(word|category) of the words
> from an files of category (bad/good e-mails) as describe below. BTW,
> To computes it on reduce, I need a sum of "spamTotal" between map
> tasks. How can i get it?
>
> Map:
>
>    /**
>     * Counts word frequency
>     */
>    public void map(LongWritable key, Text value,
>        OutputCollector<Text, FloatWritable> output, Reporter reporter)
>        throws IOException {
>      String line = value.toString();
>      String[] tokens = line.split(splitregex);
>
>      // For every word token
>      for (int i = 0; i < tokens.length; i++) {
>        String word = tokens[i].toLowerCase();
>        Matcher m = wordregex.matcher(word);
>        if (m.matches()) {
>          spamTotal++;
>          output.collect(new Text(word), count);
>        }
>      }
>    }
>  }
>
> Reduce:
>
>  /**
>   * Computes bad count / total bad words
>   */
>  public static class Reduce extends MapReduceBase implements
>      Reducer<Text, FloatWritable, Text, FloatWritable> {
>
>    public void reduce(Text key, Iterator<FloatWritable> values,
>        OutputCollector<Text, FloatWritable> output, Reporter reporter)
>        throws IOException {
>      int sum = 0;
>      while (values.hasNext()) {
>        sum += (int) values.next().get();
>      }
>
>      FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
>      output.collect(key, badProb);
>    }
>  }
>
>
> --
> Best regards, Edward J. Yoon
> [EMAIL PROTECTED]
> http://blog.udanax.org
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to