Re: [ngram] btw

Kohli Saiyam Thu, 11 Jun 2009 08:29:11 -0700

Sir,

There is a bug in statistic.pl that results in the error of the values 
diverging for long and short names.

The following piece of code does not get executed in the case of extended pmi 
name. I will send you the fixed statistic.pl in a couple of hours. I am also 
looking into other issues as well but that might take a bit longer.

if($statistic eq 'pmi')
{
  if(defined $opt_pmi_exp)
  {
    initializeStatistic($opt_pmi_exp);
  }
}

Regards,

Saiyam

________________________________
From: Ted Pedersen <duluth...@gmail.com>
To: ngram@yahoogroups.com
Sent: Thursday, June 11, 2009 8:10:37 AM
Subject: Re: [ngram] btw

Hi Gunn,

I understand all your concerns, and certainly share them. I'm afraid I
have the added problem of needing to figure out why NSP produces two
different values for 3-grams (when given the complete name name of the
measure versus the abbreviated form). That makes it very hard for me
to comment on your observation about the 3D.pm module, simply because
I'm wondering if that code is even being called when we use the short
form of the name (just pmi, etc.) So, we actually have a case here
where there are three different values being produced (yours is one,
and then NSP has different values for pmi versus
Text::NSP::Measures ::3D::MI: :pmi).

But, from your notes it seems there are two questions - one of them is
whether our formulation of the expected value calculation is correct -
in particular with regard to squaring the sample size (as found in
3D.pm). I don't honestly recall the details of that calculation, but
I'll be checking that. The other question is whether or not we are
using the natural log - again, I'll need to check on that.

Also, I'm going to post a short test case in another note, it would be
helpful to have your results on that too...

Thanks!
Ted

On Thu, Jun 11, 2009 at 6:48 AM, gunnlyse<gunnl...@yahoo. no> wrote:
>
>
> ..I would like to add that I am not trying to pester you, it's fine by me if
> the error is in my code. But I do this as part of a a larger corpus project,
> the Norwegian newspaper corpus, and we intend to publish these results.
> Therefore, I am eager to double-check that the values are correct, and I
> also wanted to make the point (in our article) that we have used the NSP
> package as our recipe (since you provide a standard for everyone to use).
> But then it is slightly awkward if the values do not at all resemble the NSP
> output;)
>
> Therefore, if you have the time: could you please look into whether you
> agree with the way I understand the formalue that I find in NSP? It should
> suffice to look at the one-ine example that I posted a few minutes ago.
> (since this calulation only uses counts that are directly retrieved from the
> input line, plus it needs to calculate the m111, the difference must lie in
> how we use these counts. Could you for instance show the computation for
> this concrete trigram the way you(r program) would do it, so that I see the
> difference?
> Can it be, for instance, that your system uses another log base than e? It
> is very awkward that your program produces a positive value for the example
> trigram whereas my program (which computes the same value as the one I
> computed manually in the recently posted mail) produces a negative value.
>
> Thank you for all your help thus far.
>
> Best,
> Gunn
>
> 

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] btw

Reply via email to