(Related question for committers: who if anyone is supporting Bayes-related code? I have this impression it's not being looked at, but, I have already made that kind of mistake on the k-means code so want to ask!)
Sounds like an easy fix; I can commit it. On Wed, Jun 15, 2011 at 6:29 PM, David Croley <[email protected]> wrote: > In some testing I have been doing with using the Bayes classifier on large > data sets, I think I have found a bug in the BayesFeatureMapper. > Specifically, it seems an Integer overflow can easily occur if dKJ is larger > than Sqrt(Integer.MAX_INT) in this code (starting on line 103): > > public boolean apply(String word, int dKJ) { > lengthNormalisationMut.add(dKJ * dKJ); > return true; > } > > I think in this happens when the docFreq for a term is high. > > I propose the following simple patch: > > Index: > core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java > =================================================================== > --- > core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java > (revision 1136124) > +++ > core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java > (working copy) > @@ -101,7 +101,8 @@ > wordList.forEachPair(new ObjectIntProcedure<String>() { > @Override > public boolean apply(String word, int dKJ) { > - lengthNormalisationMut.add(dKJ * dKJ); > + long dKJ2 = dKJ; > + lengthNormalisationMut.add(dKJ2 * dKJ2); > return true; > } > }); > > If this looks good, I can open an Issue and add the patch. > > David Croley >
