Re: [ngram] btw

Ted Pedersen Thu, 11 Jun 2009 13:33:25 -0700

Hi Gunn,

Regarding the calculation of the expected values - the most complete
documentation we have about that is found in Saiyam Kohli's MS
project, see pages 16-26 roughly speaking.


http://www.d.umn.edu/~tpederse/Pubs/saiyam-report.pdf

The short version is that I think the squaring of the nppp term is
appropriate, given the underlying model (of independence) that we are
using to compare the observed values with. So, I think the comments in
the code are not accurate, but that the calculation is (assuming that
you use the short form 'pmi' from the command line...)

More details as they emerge here, but just wanted to point you to this...

Cordially,
Ted

On Thu, Jun 11, 2009 at 8:10 AM, Ted Pedersen<duluth...@gmail.com> wrote:
> Hi Gunn,
>
> I understand all your concerns, and certainly share them. I'm afraid I
> have the added problem of needing to figure out why NSP produces two
> different values for 3-grams (when given the complete name name of the
> measure versus the abbreviated form). That makes it very hard for me
> to comment on your observation about the 3D.pm module, simply because
> I'm wondering if that code is even being called when we use the short
> form of the name (just pmi, etc.) So, we actually have a case here
> where there are three different values being produced (yours is one,
> and then NSP has different values for pmi versus
> Text::NSP::Measures::3D::MI::pmi).
>
> But, from your notes it seems there are two questions - one of them is
> whether our formulation of the expected value calculation is correct -
> in particular with regard to squaring the sample size (as found in
> 3D.pm). I don't honestly recall the details of that calculation, but
> I'll be checking that. The other question is whether or not we are
> using the natural log - again, I'll need to check on that.
>
> Also, I'm going to post a short test case in another note, it would be
> helpful to have your results on that too...
>
> Thanks!
> Ted
>
> On Thu, Jun 11, 2009 at 6:48 AM, gunnlyse<gunnl...@yahoo.no> wrote:
>>
>>
>> ..I would like to add that I am not trying to pester you, it's fine by me if
>> the error is in my code. But I do this as part of a a larger corpus project,
>> the Norwegian newspaper corpus, and we intend to publish these results.
>> Therefore, I am eager to double-check that the values are correct, and I
>> also wanted to make the point (in our article) that we have used the NSP
>> package as our recipe (since you provide a standard for everyone to use).
>> But then it is slightly awkward if the values do not at all resemble the NSP
>> output;)
>>
>> Therefore, if you have the time: could you please look into whether you
>> agree with the way I understand the formalue that I find in NSP? It should
>> suffice to look at the one-ine example that I posted a few minutes ago.
>> (since this calulation only uses counts that are directly retrieved from the
>> input line, plus it needs to calculate the m111, the difference must lie in
>> how we use these counts. Could you for instance show the computation for
>> this concrete trigram the way you(r program) would do it, so that I see the
>> difference?
>> Can it be, for instance, that your system uses another log base than e? It
>> is very awkward that your program produces a positive value for the example
>> trigram whereas my program (which computes the same value as the one I
>> computed manually in the recently posted mail) produces a negative value.
>>
>> Thank you for all your help thus far.
>>
>> Best,
>> Gunn
>>
>> 
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] btw

Reply via email to