[ngram] the NSP trigram calculations don't match mine??

gunnlyse Wed, 10 Jun 2009 16:53:01 -0700

Hi,

Due to a memory problem in using the NSP package on my trigrams, I decided that 
I would rather program the calculations myself. Not being a perl programmer, I 
can only assume that I have understood the code correctly.
I started from a .cnt file, example lines (the first line of total n-gram 
count, followed by 1 example line):


355663266
at<>det<>er<>262744 7073841 9391062 5872364 1234064 647295 1064083

 I based myself on what I found in the file 3D.pm (concerning estimated 
frequencies and observed frequencies), and translated this into lisp code. Then 
I used the specific codesfor each assocoation measure. Having done the 
programming, I tested my code for computing scores on a small data sample of 20 
lines, and run the NSP package on the same sample (NSP does not crash on this 
small sample) 
It turns out tha my values are far from similar to the ones produced by the NSP 
package, and I see no reason why. Could anyone have a look at this?

Specifically, say that I wish to compute the pmi for the example trigram above. 
According to the file pmi.pm:

"The expected values for the internal cells are calculated by taking the 
product of their associated marginals and dividing by the sample size, for 
example:

            n1pp * np1p * npp1
   m111=   --------------------
                   nppp

Pointwise Mutual Information (pmi) is defined as the log of the devitation
between the observed frequency of a trigram (n111) and the probability of
that trigram if it were independent (m111).

 PMI =   log (n111/m111)

For the trigram above, this should give:
m111= 7073841 * 9391062 * 5872364
              -------------------------
               355663266

         = 1.0968417e+12


and PMI = log (262744 / 1.0968417e+12) = -15.24452

whereas NSP's pmi 
(using the command line:
statistic.pl --ngram 3 Text::NSP::Measures::3D::MI::pmi outputfile inputfile)
produces the following line for the trigram above:

at<>det<>er<>18 -11.5906 262744 7073841 9391062 5872364 1234064 647295 1064083

Not only do the figures differ, but the ranking of trigrams also diverge.
Do I do something wrong?!? I program in LISP, the default log base is e (if it 
matters).

I am puzzled, among other things, by the fact that the pmi file states that 
m111 is computed the way I rendered it above. But in the 3D.pm file it says that

"sub computeExpectedValues
{
  my ($values)=...@_;

$m111=$n1pp*$np1p*$npp1/($nppp**2); "

Does this mean that really we compute, not 
nppp
but 
nppp*nppp ? 
(Since I do not really know perl, maybe I misunderstand)?



Thank you in advance!
Gunn

[ngram] the NSP trigram calculations don't match mine??

Reply via email to