--- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:

Dear Ted,

Thanks a lot for your comments!

Firstly, I'll try to compare results with different corpus sizes if it's 
possible to get an standard combining scores, frequencies and measures for each 
candidate. Taking into account your comments, maybe results won't be clear at 
all.   

And also, I've read your article "Fishing for Exactness" and I've noticed that 
you consider that a higher rank (.0000 score) implies that are no evidence of 
independence (is a dependent bigram)and a lower rank implies an independent 
bigram. 

In my experiments, using a gold standard, I've noticed that bigrams with lower 
ranks are an specialized bigrams in the corpus and bigrams with higher ranks 
are less common bigrams in a specialized corpus. In this case, maybe a higher 
rank implies more dependency between general words and a lower rank implies 
more independency between two words but also implies that a bigram is an 
specialized candidate in the corpus. 

Best wishes,
Mercè


>
> Hi Merce,
> 
> Join the club. :) I've been thinking about the issue of how to
> automatically identify these kinds of cutoffs off and on for some
> time, and I've never reached a satisfactory conclusion.
> 
> What you realize is that some measures give very different scores
> depending on the size of the corpus involved (ll and tmi are notable
> examples of that), and even when they are somewhat stable, the numbers
> themselves really have no inherent interpretation that makes it
> obvious that 0.25 should indicate a collocation while 0.30 should not.
> With pmi, for example, sometimes I think of scores like 5 or 10
> representing something like "the bigram occurs 5 or 10 times more
> often than expected by chance." That sort of makes sense, but is 5
> times more often than chance enough to make it a collocation? Why not
> 6 times or 4 times? :)
> 
> I get myself twisted into knots, as you can tell.
> 
> Anyway, with p-values like you get in Fisher's test you at least have
> a reliable or standard interpretation of what that p-value means - for
> example, a p-value of 0.01 will mean, more or less, that if the bigram
> you sampled is really independent (not a collocation) in the
> underlying population (of language as a whole), then there is a 1%
> chance that you would draw counts that make it look (wrongly) like it
> is dependent (ie a collocation). But, the more general point to make
> is that despite this somewhat "rigorous" interpretation of the value,
> is 0.01 really better than 0.05, and if so why wouldn't 0.001 or even
> 0.0001 be better yet? It's very hard to pin down an exactly value for
> p (that will serve as a cutoff like this).
> 
> In quite a lot of statistical literature, I think you see p-values
> used in a somewhat more descriptive fashion, where results are
> reported as "significant to a p-value of 0.0045", which then lets the
> reader decide if that's "good enough" or not.
> 
> Finally, as I scan through output from statistic.pl, I generally can't
> find a clear cutoff even when looking manually at a specific set of
> data...the stuff that is at the top is usually pretty good
> (collocations and what not) and the stuff at the bottom is very noisy,
> but in the middle it tends to be somewhat interleaved.
> 
> I am sure you have realized all of this already. Just thought I'd add
> a few more thoughts, and also encourage you or anyone else who has
> good ideas on this to please share them as they occur to you. :)
> 
> Cordially,
> Ted
> 
> 
> On Mon, Apr 20, 2009 at 1:13 PM, mercevg <merc...@...> wrote:
> >
> >
> > Dear Ted,
> >
> > Thank you very much for your answer. I know that my question is not easy to
> > answer. I have been analysing the differences between scores and measures
> > for months, but it's so difficult establish a parameter or patron to choose
> > the best measure and score.
> >
> > At the moment, Left measure is the best to rank bi-grams, as you said in the
> > FAQ document.
> >
> > Well, I continue to thinking about it!
> >
> > Best regards,
> > Mercè
> >
> > --- In ngram@yahoogroups.com, Ted Pedersen <duluthted@> wrote:
> >>
> >> Greetings Merce,
> >>
> >> Our FAQ tries to provide a little guidance on this issue...
> >>
> >> http://search.cpan.org/dist/Text-NSP/doc/FAQ.pod
> >>
> >> The short answer though is that there probably isn't a single measure
> >> that is always the "best" choice. Worse yet, in general there are not
> >> any clear "cutoffs" for any of the measures as to where you find a
> >> boundary between meaningful associations and spurious ones. Even when
> >> using p-scores (in Fisher's Exact test) you can set cutoffs of .01 .05
> >> .1 .001 .005 and so on with equal validity....
> >>
> >> So, unfortunately, there is usually a bit of trial and error involved.
> >> Some of the measure's scores are sensitive to sample size, and so even
> >> if you find a nice cutoff for one sample of data, you might not want
> >> to use that for another sample of data (if it is larger or smaller).
> >>
> >> I wish I had clearer guidance to offer, but generally speaking I don't
> >> think there are obvious answers to your question. (I would love to
> >> learn I was wrong about this though, so if anyone has advice please do
> >> come forward!)
> >>
> >> Cordially,
> >> Ted
> >>
> >> On Wed, Apr 15, 2009 at 10:36 AM, mercevg <mercevg@> wrote:
> >> >
> >> >
> >> > Dear all,
> >> >
> >> > I would like to know how to select the best score for each n-gram. At
> >> > the
> >> > moment, I have my count bi-grams list filtered by the statistical
> >> > measures.
> >> > I give us some examples:
> >> >
> >> > TMI
> >> > earth<>station<>1 0.0205 1375 2249 2598
> >> > signal<>unit<>5 0.0102 958 5446 1900
> >> >
> >> > Left
> >> > earth<>station<>1 1.0000 1375 2249 2598
> >> > signal<>unit<>1 1.0000 958 5446 1900
> >> >
> >> > Tscore
> >> > earth<>station<>1 36.7029 1375 2249 2598
> >> > signal<>unit<>2 30.1494 958 5446 1900
> >> >
> >> > How can I distinguish the best score between these three measures for
> >> > each
> >> > bi-gram? Or, in these case, maybe I have to consider just the rank value
> >> > and
> >> > not the score value to choose a collocation.
> >> >
> >> > Best regards,
> >> > Mercè
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Ted Pedersen
> >> http://www.d.umn.edu/~tpederse
> >>
> >
> > 
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Reply via email to