Hi Merce,

Join the club. :) I've been thinking about the issue of how to
automatically identify these kinds of cutoffs off and on for some
time, and I've never reached a satisfactory conclusion.

What you realize is that some measures give very different scores
depending on the size of the corpus involved (ll and tmi are notable
examples of that), and even when they are somewhat stable, the numbers
themselves really have no inherent interpretation that makes it
obvious that 0.25 should indicate a collocation while 0.30 should not.
With pmi, for example, sometimes I think of scores like 5 or 10
representing something like "the bigram occurs 5 or 10 times more
often than expected by chance." That sort of makes sense, but is 5
times more often than chance enough to make it a collocation? Why not
6 times or 4 times? :)

I get myself twisted into knots, as you can tell.

Anyway, with p-values like you get in Fisher's test you at least have
a reliable or standard interpretation of what that p-value means - for
example, a p-value of 0.01 will mean, more or less, that if the bigram
you sampled is really independent (not a collocation) in the
underlying population (of language as a whole), then there is a 1%
chance that you would draw counts that make it look (wrongly) like it
is dependent (ie a collocation). But, the more general point to make
is that despite this somewhat "rigorous" interpretation of the value,
is 0.01 really better than 0.05, and if so why wouldn't 0.001 or even
0.0001 be better yet? It's very hard to pin down an exactly value for
p (that will serve as a cutoff like this).

In quite a lot of statistical literature, I think you see p-values
used in a somewhat more descriptive fashion, where results are
reported as "significant to a p-value of 0.0045", which then lets the
reader decide if that's "good enough" or not.

Finally, as I scan through output from statistic.pl, I generally can't
find a clear cutoff even when looking manually at a specific set of
data...the stuff that is at the top is usually pretty good
(collocations and what not) and the stuff at the bottom is very noisy,
but in the middle it tends to be somewhat interleaved.

I am sure you have realized all of this already. Just thought I'd add
a few more thoughts, and also encourage you or anyone else who has
good ideas on this to please share them as they occur to you. :)

Cordially,
Ted


On Mon, Apr 20, 2009 at 1:13 PM, mercevg <merc...@yahoo.es> wrote:
>
>
> Dear Ted,
>
> Thank you very much for your answer. I know that my question is not easy to
> answer. I have been analysing the differences between scores and measures
> for months, but it's so difficult establish a parameter or patron to choose
> the best measure and score.
>
> At the moment, Left measure is the best to rank bi-grams, as you said in the
> FAQ document.
>
> Well, I continue to thinking about it!
>
> Best regards,
> Mercè
>
> --- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:
>>
>> Greetings Merce,
>>
>> Our FAQ tries to provide a little guidance on this issue...
>>
>> http://search.cpan.org/dist/Text-NSP/doc/FAQ.pod
>>
>> The short answer though is that there probably isn't a single measure
>> that is always the "best" choice. Worse yet, in general there are not
>> any clear "cutoffs" for any of the measures as to where you find a
>> boundary between meaningful associations and spurious ones. Even when
>> using p-scores (in Fisher's Exact test) you can set cutoffs of .01 .05
>> .1 .001 .005 and so on with equal validity....
>>
>> So, unfortunately, there is usually a bit of trial and error involved.
>> Some of the measure's scores are sensitive to sample size, and so even
>> if you find a nice cutoff for one sample of data, you might not want
>> to use that for another sample of data (if it is larger or smaller).
>>
>> I wish I had clearer guidance to offer, but generally speaking I don't
>> think there are obvious answers to your question. (I would love to
>> learn I was wrong about this though, so if anyone has advice please do
>> come forward!)
>>
>> Cordially,
>> Ted
>>
>> On Wed, Apr 15, 2009 at 10:36 AM, mercevg <merc...@...> wrote:
>> >
>> >
>> > Dear all,
>> >
>> > I would like to know how to select the best score for each n-gram. At
>> > the
>> > moment, I have my count bi-grams list filtered by the statistical
>> > measures.
>> > I give us some examples:
>> >
>> > TMI
>> > earth<>station<>1 0.0205 1375 2249 2598
>> > signal<>unit<>5 0.0102 958 5446 1900
>> >
>> > Left
>> > earth<>station<>1 1.0000 1375 2249 2598
>> > signal<>unit<>1 1.0000 958 5446 1900
>> >
>> > Tscore
>> > earth<>station<>1 36.7029 1375 2249 2598
>> > signal<>unit<>2 30.1494 958 5446 1900
>> >
>> > How can I distinguish the best score between these three measures for
>> > each
>> > bi-gram? Or, in these case, maybe I have to consider just the rank value
>> > and
>> > not the score value to choose a collocation.
>> >
>> > Best regards,
>> > Mercè
>> >
>> >
>>
>>
>>
>> --
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>>
>
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to