I would like to get the spam probability P(word|category) of the words
from an files of category (bad/good e-mails) as describe below. BTW,
To computes it on reduce, I need a sum of "spamTotal" between map
tasks. How can i get it?
Map:
/**
* Counts word frequency
*/
public void map(LongWritable key, Text value,
OutputCollector<Text, FloatWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String[] tokens = line.split(splitregex);
// For every word token
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
spamTotal++;
output.collect(new Text(word), count);
}
}
}
}
Reduce:
/**
* Computes bad count / total bad words
*/
public static class Reduce extends MapReduceBase implements
Reducer<Text, FloatWritable, Text, FloatWritable> {
public void reduce(Text key, Iterator<FloatWritable> values,
OutputCollector<Text, FloatWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += (int) values.next().get();
}
FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
output.collect(key, badProb);
}
}
--
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org