(Related question for committers: who if anyone is supporting
Bayes-related code? I have this impression it's not being looked at,
but, I have already made that kind of mistake on the k-means code so
want to ask!)

Sounds like an easy fix; I can commit it.

On Wed, Jun 15, 2011 at 6:29 PM, David Croley <[email protected]> wrote:
> In some testing I have been doing with using the Bayes classifier on large
> data sets, I think I have found a bug in the BayesFeatureMapper.
> Specifically, it seems an Integer overflow can easily occur if dKJ is larger
> than Sqrt(Integer.MAX_INT) in this code (starting on line 103):
>
>      public boolean apply(String word, int dKJ) {
>        lengthNormalisationMut.add(dKJ * dKJ);
>        return true;
>      }
>
> I think in this happens when the docFreq for a term is high.
>
> I propose the following simple patch:
>
> Index:
> core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
> ===================================================================
> ---
> core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
> (revision 1136124)
> +++
> core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureMapper.java
> (working copy)
> @@ -101,7 +101,8 @@
>     wordList.forEachPair(new ObjectIntProcedure<String>() {
>       @Override
>       public boolean apply(String word, int dKJ) {
> -        lengthNormalisationMut.add(dKJ * dKJ);
> +        long dKJ2 = dKJ;
> +        lengthNormalisationMut.add(dKJ2 * dKJ2);
>         return true;
>       }
>     });
>
> If this looks good, I can open an Issue and add the patch.
>
> David Croley
>

Reply via email to