Re: [ngram] Select an score

Ted Pedersen Fri, 17 Apr 2009 16:28:07 -0700

Greetings Merce,

Our FAQ tries to provide a little guidance on this issue...

http://search.cpan.org/dist/Text-NSP/doc/FAQ.pod

The short answer though is that there probably isn't a single measure
that is always the "best" choice. Worse yet, in general there are not
any clear "cutoffs" for any of the measures as to where you find a
boundary between meaningful associations and spurious ones. Even when
using p-scores (in Fisher's Exact test) you can set cutoffs of .01 .05
.1 .001 .005 and so on with equal validity....

So, unfortunately, there is usually a bit of trial and error involved.
Some of the measure's scores are sensitive to sample size, and so even
if you find a nice cutoff for one sample of data, you might not want
to use that for another sample of data (if it is larger or smaller).

I wish I had clearer guidance to offer, but generally speaking I don't
think there are obvious answers to your question. (I would love to
learn I was wrong about this though, so if anyone has advice please do
come forward!)

Cordially,
Ted

On Wed, Apr 15, 2009 at 10:36 AM, mercevg <merc...@yahoo.es> wrote:
>
>
> Dear all,
>
> I would like to know how to select the best score for each n-gram. At the
> moment, I have my count bi-grams list filtered by the statistical measures.
> I give us some examples:
>
> TMI
> earth<>station<>1 0.0205 1375 2249 2598
> signal<>unit<>5 0.0102 958 5446 1900
>
> Left
> earth<>station<>1 1.0000 1375 2249 2598
> signal<>unit<>1 1.0000 958 5446 1900
>
> Tscore
> earth<>station<>1 36.7029 1375 2249 2598
> signal<>unit<>2 30.1494 958 5446 1900
>
> How can I distinguish the best score between these three measures for each
> bi-gram? Or, in these case, maybe I have to consider just the rank value and
> not the score value to choose a collocation.
>
> Best regards,
> Mercè
>
> 

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] Select an score

Reply via email to