Hi, Due to a memory problem in using the NSP package on my trigrams, I decided that I would rather program the calculations myself. Not being a perl programmer, I can only assume that I have understood the code correctly. I started from a .cnt file, example lines (the first line of total n-gram count, followed by 1 example line):
355663266 at<>det<>er<>262744 7073841 9391062 5872364 1234064 647295 1064083 I based myself on what I found in the file 3D.pm (concerning estimated frequencies and observed frequencies), and translated this into lisp code. Then I used the specific codesfor each assocoation measure. Having done the programming, I tested my code for computing scores on a small data sample of 20 lines, and run the NSP package on the same sample (NSP does not crash on this small sample) It turns out tha my values are far from similar to the ones produced by the NSP package, and I see no reason why. Could anyone have a look at this? Specifically, say that I wish to compute the pmi for the example trigram above. According to the file pmi.pm: "The expected values for the internal cells are calculated by taking the product of their associated marginals and dividing by the sample size, for example: n1pp * np1p * npp1 m111= -------------------- nppp Pointwise Mutual Information (pmi) is defined as the log of the devitation between the observed frequency of a trigram (n111) and the probability of that trigram if it were independent (m111). PMI = log (n111/m111) For the trigram above, this should give: m111= 7073841 * 9391062 * 5872364 ------------------------- 355663266 = 1.0968417e+12 and PMI = log (262744 / 1.0968417e+12) = -15.24452 whereas NSP's pmi (using the command line: statistic.pl --ngram 3 Text::NSP::Measures::3D::MI::pmi outputfile inputfile) produces the following line for the trigram above: at<>det<>er<>18 -11.5906 262744 7073841 9391062 5872364 1234064 647295 1064083 Not only do the figures differ, but the ranking of trigrams also diverge. Do I do something wrong?!? I program in LISP, the default log base is e (if it matters). I am puzzled, among other things, by the fact that the pmi file states that m111 is computed the way I rendered it above. But in the 3D.pm file it says that "sub computeExpectedValues { my ($values)=...@_; $m111=$n1pp*$np1p*$npp1/($nppp**2); " Does this mean that really we compute, not nppp but nppp*nppp ? (Since I do not really know perl, maybe I misunderstand)? Thank you in advance! Gunn