In some testing I have been doing with using the Bayes classifier on large
data sets, I think I have found a bug in the BayesFeatureMapper.
Specifically, it seems an Integer overflow can easily occur if dKJ is larger
than Sqrt(Integer.MAX_INT) in this code (starting on line 103):

      public boolean apply(String word, int dKJ) {
        lengthNormalisationMut.add(dKJ * dKJ);
        return true;
      }

I think in this happens when the docFreq for a term is high.

I propose the following simple patch:

Index:
core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
===================================================================
---
core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
(revision 1136124)
+++
core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
(working copy)
@@ -101,7 +101,8 @@
     wordList.forEachPair(new ObjectIntProcedure<String>() {
       @Override
       public boolean apply(String word, int dKJ) {
-        lengthNormalisationMut.add(dKJ * dKJ);
+        long dKJ2 = dKJ;
+        lengthNormalisationMut.add(dKJ2 * dKJ2);
         return true;
       }
     });

If this looks good, I can open an Issue and add the patch.

David Croley

Reply via email to