Hi everyone!

> The NSP values still do not match mine, and I see that it concerns ll, pmi, 
> ps as well as tmi for trigrams. Evidently, there must be some error which 
> probably lies in the observed or estimated frequencies (since all four 
> measures produce different results than mine)
> 
> I need to ask for two clarifications:
> (1) estimated frequency: The webpage/pmi file says:
>                 n1pp * np1p * npp1
>    m111= --------------------
>                 nppp
> but the file 3D.pm says 
>   $m111=$n1pp*$np1p*$npp1/($nppp**2); "
> 
>  which I take to mean that we use, not nppp, but the exponent:
>                 n1pp * np1p * npp1
>  m111= --------------------
>                 nppp * nppp
> If so, which one sould I really use? 

The correct expected co-occurrence frequency under an independence
hypothesis is the second one, with the denominator squared.  It's easy
to make this clear to yourself if you keep its mathematical derviation
in mind:

- The occurrence probability of the first word is (n1pp/nppp); of the
second word (np1p/nppp); etc.

- The probability of all three words occurring next to each other by
chance, i.e. the co-occurrence probability under an independence null
hypothesis, is the product of the three probabilities:
(n1pp/nppp)*(np1p/nppp)*(npp1/nppp) = n1pp * np1p * npp1 / (nppp**3)

- Multiply this probability by sample size nppp to obtain the expected
frequency under the independence null


> (2) Furthermore, let us return to the example trigram. When I
compute the example trigram's pmi in the way I understand the code, I
get the value -15.24452, instead of the NSP package's  6.4127. 

Not surprising: your expected frequency is way too high (by a factor
of nppp), so you have a lower co-occurrence frequency than expected
and hence a negative association.


The standard definition of PMI uses base-2 logarithms (because of its
roots in information theory), so the resulting value can be
interpreted as "bits of mutual information".  Other implementations
diverge from this; e.g., for my own code in the UCS toolkit I made the
regrettable decision to use base-10 logarithms.  Note that all
versions should still give the same ranking of candidates, so that's a
"robust" test case.


Cheers,
Stefan (Evert)

Reply via email to