Hi Gunn,

You might be hitting a peculiar bug we notice late last year (which still 
hasn't been fixed).

http://tech.groups.yahoo.com/group/ngram/message/240

If you run using just pmi in the command line, do your results agree with your 
Lisp code?

If there is still disagreement, let's run some tests on some common input and 
see if we can isolate why those differences exist...

To be honest I haven't looked at the PMI code in a while so I don't recall all 
the details, but I'll do that and respond in more detail. Just wanted to see if 
the above resolves anything for you.

Cordially,
Ted

--- In ngram@yahoogroups.com, "gunnlyse" <gunnl...@...> wrote:
>
> Hi,
> 
> Due to a memory problem in using the NSP package on my trigrams, I decided 
> that I would rather program the calculations myself. Not being a perl 
> programmer, I can only assume that I have understood the code correctly.
> I started from a .cnt file, example lines (the first line of total n-gram 
> count, followed by 1 example line):
> 
> 355663266
> at<>det<>er<>262744 7073841 9391062 5872364 1234064 647295 1064083
> 
>  I based myself on what I found in the file 3D.pm (concerning estimated 
> frequencies and observed frequencies), and translated this into lisp code. 
> Then I used the specific codesfor each assocoation measure. Having done the 
> programming, I tested my code for computing scores on a small data sample of 
> 20 lines, and run the NSP package on the same sample (NSP does not crash on 
> this small sample) 
> It turns out tha my values are far from similar to the ones produced by the 
> NSP package, and I see no reason why. Could anyone have a look at this?
> 
> Specifically, say that I wish to compute the pmi for the example trigram 
> above. According to the file pmi.pm:
> 
> "The expected values for the internal cells are calculated by taking the 
> product of their associated marginals and dividing by the sample size, for 
> example:
> 
>             n1pp * np1p * npp1
>    m111=   --------------------
>                    nppp
> 
> Pointwise Mutual Information (pmi) is defined as the log of the devitation
> between the observed frequency of a trigram (n111) and the probability of
> that trigram if it were independent (m111).
> 
>  PMI =   log (n111/m111)
> 
> For the trigram above, this should give:
> m111= 7073841 * 9391062 * 5872364
>               -------------------------
>                355663266
> 
>          = 1.0968417e+12
> 
> 
> and PMI = log (262744 / 1.0968417e+12) = -15.24452
> 
> whereas NSP's pmi 
> (using the command line:
> statistic.pl --ngram 3 Text::NSP::Measures::3D::MI::pmi outputfile inputfile)
> produces the following line for the trigram above:
> 
> at<>det<>er<>18 -11.5906 262744 7073841 9391062 5872364 1234064 647295 1064083
> 
> Not only do the figures differ, but the ranking of trigrams also diverge.
> Do I do something wrong?!? I program in LISP, the default log base is e (if 
> it matters).
> 
> I am puzzled, among other things, by the fact that the pmi file states that 
> m111 is computed the way I rendered it above. But in the 3D.pm file it says 
> that
> 
> "sub computeExpectedValues
> {
>   my ($values)=...@_;
> 
> $m111=$n1pp*$np1p*$npp1/($nppp**2); "
> 
> Does this mean that really we compute, not 
> nppp
> but 
> nppp*nppp ? 
> (Since I do not really know perl, maybe I misunderstand)?
> 
> 
> 
> Thank you in advance!
> Gunn
>


Reply via email to