Hello, thank you for this clarification! The NSP values still do not match mine, and I see that it concerns ll, pmi, ps as well as tmi for trigrams. Evidently, there must be some error which probably lies in the observed or estimated frequencies (since all four measures produce different results than mine)
I need to ask for two clarifications: (1) estimated frequency: The webpage/pmi file says: n1pp * np1p * npp1 m111= -------------------- nppp but the file 3D.pm says $m111=$n1pp*$np1p*$npp1/($nppp**2); " which I take to mean that we use, not nppp, but the exponent: n1pp * np1p * npp1 m111= -------------------- nppp * nppp If so, which one sould I really use? (2) Furthermore, let us return to the example trigram. When I compute the example trigram's pmi in the way I understand the code, I get the value -15.24452, instead of the NSP package's 6.4127. All the observed frequencies needed for pmi are directly available in the example trigram line, so the only thing that can explain diverging results is HOW we compute the value. May I therefore ask if you agree with the way I understand the code? For the trigram 355663266 at<>det<>er<>262744 7073841 9391062 5872364 1234064 647295 1064083 I compute m111 as: > > m111= 7073841 * 9391062 * 5872364 > > ------------------------- > > 355663266 > > > > = 1.0968417e+12 > > > > and PMI = log (262744 / 1.0968417e+12) = -15.24452 > > NSP's pmi returns (using the command line: > > statistic.pl --ngram 3 pmi outputfile inputfile ) > > produces the following line at<>det<>er<>1 6.4127 262744 7073841 9391062 5872364 1234064 647295 1064083 Best, Gunn