It would be reasonable and pretty easy to add an additional method to LogLikelihood that accepts longs.
Peter, would you like to produce a simple patch? On Tue, Jun 21, 2011 at 8:09 AM, peter andrews (JIRA) <[email protected]>wrote: > Collocation driver has long being statically cast to an int > ----------------------------------------------------------- > > Key: MAHOUT-738 > URL: https://issues.apache.org/jira/browse/MAHOUT-738 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.5 > Reporter: peter andrews > Priority: Minor > > > org.apache.mahout.vectorizer.collocations.llr.LLRReducer, which is part of > the collocation driver, statically casts a long to an int. > > private long ngramTotal; > ... > int k11 = ngram.getFrequency(); /* a&b */ > int k12 = gramFreq[0] - ngram.getFrequency(); /* a&!b */ > int k21 = gramFreq[1] - ngram.getFrequency(); /* !b&a */ > int k22 = (int) (ngramTotal - (gramFreq[0] + gramFreq[1] - > ngram.getFrequency())); /* !a&!b */ > > These numbers are then fed into > > org.apache.mahout.math.stats.LogLikelihood > > specifically the function below. > > public static double logLikelihoodRatio(int k11, int k12, int k21, int k22) > { > // note that we have counts here, not probabilities, and that the entropy > is not normalized. > double rowEntropy = entropy(k11, k12) + entropy(k21, k22); > double columnEntropy = entropy(k11, k21) + entropy(k12, k22); > double matrixEntropy = entropy(k11, k12, k21, k22); > if (rowEntropy + columnEntropy > matrixEntropy) { > // round off error > return 0.0; > } > return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); > } > > In short if the long ngramTotal is larger than Integer.MAX_VALUE (which > will happen in large datasets), then the driver will either crash or in the > case that it casts to a negative int, will continue as usual but produce no > output due to error checking. > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >
