Hi everyone! > The NSP values still do not match mine, and I see that it concerns ll, pmi, > ps as well as tmi for trigrams. Evidently, there must be some error which > probably lies in the observed or estimated frequencies (since all four > measures produce different results than mine) > > I need to ask for two clarifications: > (1) estimated frequency: The webpage/pmi file says: > n1pp * np1p * npp1 > m111= -------------------- > nppp > but the file 3D.pm says > $m111=$n1pp*$np1p*$npp1/($nppp**2); " > > which I take to mean that we use, not nppp, but the exponent: > n1pp * np1p * npp1 > m111= -------------------- > nppp * nppp > If so, which one sould I really use?
The correct expected co-occurrence frequency under an independence hypothesis is the second one, with the denominator squared. It's easy to make this clear to yourself if you keep its mathematical derviation in mind: - The occurrence probability of the first word is (n1pp/nppp); of the second word (np1p/nppp); etc. - The probability of all three words occurring next to each other by chance, i.e. the co-occurrence probability under an independence null hypothesis, is the product of the three probabilities: (n1pp/nppp)*(np1p/nppp)*(npp1/nppp) = n1pp * np1p * npp1 / (nppp**3) - Multiply this probability by sample size nppp to obtain the expected frequency under the independence null > (2) Furthermore, let us return to the example trigram. When I compute the example trigram's pmi in the way I understand the code, I get the value -15.24452, instead of the NSP package's 6.4127. Not surprising: your expected frequency is way too high (by a factor of nppp), so you have a lower co-occurrence frequency than expected and hence a negative association. The standard definition of PMI uses base-2 logarithms (because of its roots in information theory), so the resulting value can be interpreted as "bits of mutual information". Other implementations diverge from this; e.g., for my own code in the UCS toolkit I made the regrettable decision to use base-10 logarithms. Note that all versions should still give the same ranking of candidates, so that's a "robust" test case. Cheers, Stefan (Evert)