In some testing I have been doing with using the Bayes classifier on large
data sets, I think I have found a bug in the BayesFeatureMapper.
Specifically, it seems an Integer overflow can easily occur if dKJ is larger
than Sqrt(Integer.MAX_INT) in this code (starting on line 103):
public boolean apply(String word, int dKJ) {
lengthNormalisationMut.add(dKJ * dKJ);
return true;
}
I think in this happens when the docFreq for a term is high.
I propose the following simple patch:
Index:
core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
===================================================================
---
core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
(revision 1136124)
+++
core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
(working copy)
@@ -101,7 +101,8 @@
wordList.forEachPair(new ObjectIntProcedure<String>() {
@Override
public boolean apply(String word, int dKJ) {
- lengthNormalisationMut.add(dKJ * dKJ);
+ long dKJ2 = dKJ;
+ lengthNormalisationMut.add(dKJ2 * dKJ2);
return true;
}
});
If this looks good, I can open an Issue and add the patch.
David Croley