It would be reasonable and pretty easy to add an additional method to
LogLikelihood
that accepts longs.

Peter, would you like to produce a simple patch?

On Tue, Jun 21, 2011 at 8:09 AM, peter andrews (JIRA) <[email protected]>wrote:

> Collocation driver has long being statically cast to an int
> -----------------------------------------------------------
>
>                 Key: MAHOUT-738
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-738
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: peter andrews
>            Priority: Minor
>
>
> org.apache.mahout.vectorizer.collocations.llr.LLRReducer, which is part of
> the collocation driver, statically casts a long to an int.
>
> private long ngramTotal;
> ...
> int k11 = ngram.getFrequency(); /* a&b */
> int k12 = gramFreq[0] - ngram.getFrequency(); /* a&!b */
> int k21 = gramFreq[1] - ngram.getFrequency(); /* !b&a */
> int k22 = (int) (ngramTotal - (gramFreq[0] + gramFreq[1] -
> ngram.getFrequency())); /* !a&!b */
>
> These numbers are then fed into
>
> org.apache.mahout.math.stats.LogLikelihood
>
> specifically the function below.
>
> public static double logLikelihoodRatio(int k11, int k12, int k21, int k22)
> {
>  // note that we have counts here, not probabilities, and that the entropy
> is not normalized.
>  double rowEntropy = entropy(k11, k12) + entropy(k21, k22);
>  double columnEntropy = entropy(k11, k21) + entropy(k12, k22);
>  double matrixEntropy = entropy(k11, k12, k21, k22);
>  if (rowEntropy + columnEntropy > matrixEntropy) {
>    // round off error
>    return 0.0;
>  }
>  return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
> }
>
> In short if the long ngramTotal is larger than Integer.MAX_VALUE (which
> will happen in large datasets), then the driver will either crash or in the
> case that it casts to a negative int, will continue as usual but produce no
> output due to error checking.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Reply via email to