Re: [ngram] the NSP trigram calculations don't match mine??

Ted Pedersen Wed, 10 Jun 2009 18:00:10 -0700

BTW, when I ran your input file (calling it 'in'):

355663266
at<>det<>er<>262744 7073841 9391062 5872364 1234064 647295 1064083


marimba(29): statistic.pl --ngram 3 pmi test.out in

I got the following output...

marimba(30): more test.out
355663266
at<>det<>er<>1 6.4127 262744 7073841 9391062 5872364 1234064 647295 1064083

However, when I run using the full name things don't look so good...

marimba(35): statistic.pl --ngram 3 Text::NSP::Measures::3D::MI::pmi test.out in
Use of uninitialized value in exponentiation (**) at
/usr/local/share/perl/5.8.8/Text/NSP/Measures/3D/MI/pmi.pm line 113,
<SRC> line 2.

I get something like what you are getting...

marimba(37): more test.out
355663266
at<>det<>er<>1 -11.5906 262744 7073841 9391062 5872364 1234064 647295 1064083

So, I think the bug on using the full form of the name is the root of
the problem here. The simple workaround is to use the short form,
although clearly this needs to be fixed...

Thanks,
Ted

On Wed, Jun 10, 2009 at 3:00 PM, gunnlyse<gunnl...@yahoo.no> wrote:
>
>
> Hi,
>
> Due to a memory problem in using the NSP package on my trigrams, I decided
> that I would rather program the calculations myself. Not being a perl
> programmer, I can only assume that I have understood the code correctly.
> I started from a .cnt file, example lines (the first line of total n-gram
> count, followed by 1 example line):
>
> 355663266
> at<>det<>er<>262744 7073841 9391062 5872364 1234064 647295 1064083
>
> I based myself on what I found in the file 3D.pm (concerning estimated
> frequencies and observed frequencies), and translated this into lisp code.
> Then I used the specific codesfor each assocoation measure. Having done the
> programming, I tested my code for computing scores on a small data sample of
> 20 lines, and run the NSP package on the same sample (NSP does not crash on
> this small sample)
> It turns out tha my values are far from similar to the ones produced by the
> NSP package, and I see no reason why. Could anyone have a look at this?
>
> Specifically, say that I wish to compute the pmi for the example trigram
> above. According to the file pmi.pm:
>
> "The expected values for the internal cells are calculated by taking the
> product of their associated marginals and dividing by the sample size, for
> example:
>
> n1pp * np1p * npp1
> m111= --------------------
> nppp
>
> Pointwise Mutual Information (pmi) is defined as the log of the devitation
> between the observed frequency of a trigram (n111) and the probability of
> that trigram if it were independent (m111).
>
> PMI = log (n111/m111)
>
> For the trigram above, this should give:
> m111= 7073841 * 9391062 * 5872364
> -------------------------
> 355663266
>
> = 1.0968417e+12
>
> and PMI = log (262744 / 1.0968417e+12) = -15.24452
>
> whereas NSP's pmi
> (using the command line:
> statistic.pl --ngram 3 Text::NSP::Measures::3D::MI::pmi outputfile
> inputfile)
> produces the following line for the trigram above:
>
> at<>det<>er<>18 -11.5906 262744 7073841 9391062 5872364 1234064 647295
> 1064083
>
> Not only do the figures differ, but the ranking of trigrams also diverge.
> Do I do something wrong?!? I program in LISP, the default log base is e (if
> it matters).
>
> I am puzzled, among other things, by the fact that the pmi file states that
> m111 is computed the way I rendered it above. But in the 3D.pm file it says
> that
>
> "sub computeExpectedValues
> {
> my ($values)=...@_;
>
> $m111=$n1pp*$np1p*$npp1/($nppp**2); "
>
> Does this mean that really we compute, not
> nppp
> but
> nppp*nppp ?
> (Since I do not really know perl, maybe I misunderstand)?
>
> Thank you in advance!
> Gunn
>
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] the NSP trigram calculations don't match mine??

Reply via email to