Greetings all,

Thanks to Stefan for the very complete and lucid explanation on
computing PMI scores. We'll be updating the documentation to reflect
our actual calculation (which we are relieved to find seems to be
correct), and also fixing that issue with the "long form" of the
measure not working properly from the command line.

As to the issue of the log we are using, we are using the Perl log(x)
function, which returns the natural log (base e). So, values between
different systems may well differ depending on the log used, but the
relative ranking should be the same. This is a further concern when
thinking about the issue of cutoffs (what value of PMI indicates that
I've found a collocation, for example...) since someone might report
one value (using log 2) while someone else reports a value using log 2
or log 10. So, just something to be careful of perhaps...

Cordially,
Ted

On Sun, Jun 14, 2009 at 10:19 AM, Stefan
Evert<ev...@ims.uni-stuttgart.de> wrote:
>
>
>
> Hi everyone!
>
>> The NSP values still do not match mine, and I see that it concerns ll,
>> pmi, ps as well as tmi for trigrams. Evidently, there must be some error
>> which probably lies in the observed or estimated frequencies (since all four
>> measures produce different results than mine)
>>
>> I need to ask for two clarifications:
>> (1) estimated frequency: The webpage/pmi file says:
>> n1pp * np1p * npp1
>> m111= --------------------
>> nppp
>> but the file 3D.pm says
>> $m111=$n1pp*$np1p*$npp1/($nppp**2); "
>>
>> which I take to mean that we use, not nppp, but the exponent:
>> n1pp * np1p * npp1
>> m111= --------------------
>> nppp * nppp
>> If so, which one sould I really use?
>
> The correct expected co-occurrence frequency under an independence
> hypothesis is the second one, with the denominator squared. It's easy
> to make this clear to yourself if you keep its mathematical derviation
> in mind:
>
> - The occurrence probability of the first word is (n1pp/nppp); of the
> second word (np1p/nppp); etc.
>
> - The probability of all three words occurring next to each other by
> chance, i.e. the co-occurrence probability under an independence null
> hypothesis, is the product of the three probabilities:
> (n1pp/nppp)*(np1p/nppp)*(npp1/nppp) = n1pp * np1p * npp1 / (nppp**3)
>
> - Multiply this probability by sample size nppp to obtain the expected
> frequency under the independence null
>
>> (2) Furthermore, let us return to the example trigram. When I
> compute the example trigram's pmi in the way I understand the code, I
> get the value -15.24452, instead of the NSP package's 6.4127.
>
> Not surprising: your expected frequency is way too high (by a factor
> of nppp), so you have a lower co-occurrence frequency than expected
> and hence a negative association.
>
> The standard definition of PMI uses base-2 logarithms (because of its
> roots in information theory), so the resulting value can be
> interpreted as "bits of mutual information". Other implementations
> diverge from this; e.g., for my own code in the UCS toolkit I made the
> regrettable decision to use base-10 logarithms. Note that all
> versions should still give the same ranking of candidates, so that's a
> "robust" test case.
>
> Cheers,
> Stefan (Evert)
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to