[ngram] Re: Some questions about Text-NSP

Ted Pedersen tpede...@d.umn.edu [ngram] Thu, 06 Dec 2018 06:41:55 -0800

My apologies for being a bit slow in following up on this. But, I
think for identifying significant or interesting bigrams with Fisher's
exact test, a left sided test makes the most sense. The left sided
test gives us the probability that the pair of words would occur
together less frequently if we repeated our experiment on another
sample of text. If the left sided probability is high it means our
current observation is a much more frequent than we'd expect (just
based on pure chance) and so the pair of words we have observed are
that likely to be significant or interesting.


I hope this makes some sense, but please feel free to follow up if it
doesn't or if you think I may be misinterpreting something here.

Cordially,
Ted

---
Ted Pedersen
http://www.d.umn.edu/~tpederse

On Sun, Nov 25, 2018 at 6:28 PM Ted Pedersen <tpede...@d.umn.edu> wrote:
>
> Thanks for these questions - all of the details are quite helpful. And
> yes, I think your method for computing n12 and n22 are just fine.
>
> As a historical note, it's worth pointing out the Fishing for
> Exactness paper pre-dates Text-NSP by a number of years. This paper
> was published 1996, and Text-NSP began in about 2002 and was actively
> developed for several years thereafter. That said, when implementing
> Text-NSP we were certainly basing it off of this earlier work and so
> I'd hope the results from Text-NSP would be consistent with the paper.
> To that end I ran the example you gave on Text-NSP and show the
> results below. What you see is consistent with what you ran in python,
> and so it seems pretty clear that the results from the paper are
> indeed the two tailed test (contrary to what the paper says).
>
> cat x.cnt
> 1382828
> and<>industry<>22 30707 952
>
> statistic.pl leftFisher x.left x.cnt
>
> cat x.left
> 1382828
> and<>industry<>1 0.6297 22 30707 952
>
> statistic.pl rightFisher x.right x.cnt
>
> cat x.right
> 1382828
> and<>industry<>1 0.4546 22 30707 952
>
> statistic.pl twotailed x.two x.cnt
>
> cat x.two
> 1382828
> and<>industry<>1 0.8253 22 30707 952
>
> As to your more general question of what should be done, I will need
> to refresh my recollection of this, although in general the
> interpretation of left, right and two sided tests depend on your null
> hypothesis. In our case, and for finding "dependent" bigrams in
> general, the null hypothesis is that the two words are independent,
> and so we are seeking evidence to either confirm or deny that
> hypothesis. The left sided test (for Fisher's exact) is giving us the
> p-value of n11 < 22. How to interpret that is where I need to refresh
> my recollection, but that is the general direction things are heading.
>
> I think a one sided test makes more sense for identifying dependent
> bigrams, since in general if you have more occurrences than you expect
> by chance, at some point beyond that expected value you are going to
> decide it's not a chance occurrence. There is no value above the
> expected value where you are going to say (I don't think) oh no, these
> two words are no longer dependent on each other (ie they are occurring
> too frequently to be dependent). I think a two tailed test makes the
> most sense if there is a point both above and below the expected value
> where your null hypothesis is potentially rejected.
>
> In the case of "and industry" where the expected value is 21.14, it
> seems very hard to argue that 22 occurrences is enough to say that
> they are dependent. But, this is where I'm just a little foggy right
> now. I'll look at this a little more and reply a bit more precisely.
>
> I'm not sure about they keyword extraction case, but if you have an
> example I'd be happy to think a little further about that as well!
>
> More soon,
> Ted
> ---
> Ted Pedersen
> http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
> Serene <blkser...@gmail.com> wrote:
> >
> > Thanks for the clarification!
> >
> > And I have some other question about your paper "Fishing for Exactness"
> >
> > 1. The paper says that "In the test for association to determine bigram 
> > dependence Fisher's exact test is interpreted as a left-sided test."
> > And in last part "Experiment: Test for Association", it also says that "In 
> > this experiment, we compare the significance values computed using the 
> > t-test, the x2 approximation to the distribution of both G2 and X2 and 
> > Fisher's exact test (left sided)".
> > But as for the examples given in "Figure 8: test for association: <word> 
> > industry":
> > E.g. for word "and", the given data is:
> >     n++ (total number of tokens in the corpus): 1382828 (taken from "Figure 
> > 3")
> >     n+1 (total frequency of "industry"): 952 (taken from "Figure 3")
> >
> >     n11 = 22
> >     n21 = 952 - 22 = 930
> >
> > Since n12 is not given in the table, I have to compute it by
> >     m11 = n1+ * n+1 / n++
> >     so n1+ is 21.14 * 1382828 / 952 = 30706.915882352943 (approximately 
> > 30707)
> >
> > And then:
> >     n12 = 30707 - 22 = 30685
> >     n22 = 1382828 - 952 - 30707 + 22 = 1351191
> >
> > I'm not sure if my calculation is correct, but when using n11 = 22, n12 = 
> > 30685, n21 = 930, n22 = 1351191 as the input, the left-sided fisher's exact 
> > test gives the result 0.6296644386744733 which is not matched with 0.8255 
> > given in the example. I use Python's Scipy module to calculate this:
> >
> > >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> > >>> 'less') # the parameter "alternative" specifies the left-sided test be 
> > >>> used
> > (1.041670459980972, 0.6296644386744733) # The first value is Odds Ratio 
> > (irrelevant), the second is the p-value given by Fisher's exact test
> >
> > Then I tried the two-tailed test, which gave the expected value 
> > (approximately):
> >
> > >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> > >>> 'two-sided') # Two-sided test
> > (1.041670459980972, 0.8253462481347)
> >
> > So I suppose that the results given in the figure is actually calculated 
> > using the two-sided Fisher's exact test (is it a mistake or that the 
> > two-sided test should be used instead?)
> >
> > 2. I've noticed the left-sided, right-sided, two-sided Fisher's exact test 
> > are all implemented in NSP, so which one is preferred in general case?(or 
> > it has to be determined by the purpose of the research?). Since I'm writing 
> > a corpus tool to be used by myself and other researchers, implementing too 
> > many similar significance tests would confuse those who know little about 
> > math or statistics.
> >
> > 3. The paper discuss mainly about the context of collocation identification 
> > (two words in the same corpus), but it is cited in "Embracing Bayes Factors 
> > for Key Item Analysis in Corpus Linguistics" (Wilson, 3), which talked 
> > about measures used in keyword extraction (same word in two different 
> > corpus). So I'm wondering that if it is suitable to use Fisher's exact test 
> > in the context of both collocation identification and keyword extraction.
> >
> > Sorry for so many questions, thanks in advance.
> >
> > On Sun, Nov 25, 2018 at 10:54 PM Ted Pedersen <tpede...@d.umn.edu> wrote:
> >>
> >> Hi Blk,
> >>
> >> Thanks for pointing these out. On the Poisson Stirling measure, I
> >> think the reason we haven't included log n is that log n would simply
> >> be a constant (log of the total number of bigrams) and so would not
> >> change the rankings that we get from these scores. That said, if you
> >> were comparing scores across different sized corpora then the
> >> denominator would likely be important to include.
> >>
> >> Thanks for pointing out the typos. Text-NSP is right now in a fairly
> >> dormant state, but I do have a list of small changes to make and will
> >> add yours to these.
> >>
> >> Thanks for your interest, and please let us know if you have any other
> >> questions.
> >>
> >> Cordially,
> >> Ted
> >> ---
> >> Ted Pedersen
> >> http://www.d.umn.edu/~tpederse
> >>
> >> On Sun, Nov 25, 2018 at 4:13 AM BLK Serene <blkser...@gmail.com> wrote:
> >> >
> >> > Hi, I have some questions about the association measures implemented in 
> >> > Text-NSP:
> >> >
> >> > The Poisson-Sterling Measure given in the documentation is:
> >> > Poisson-Stirling = n11 * ( log(n11) - log(m11) - 1)
> >> >
> >> > But in Quasthoff's paper the formulae given by the author is:
> >> > sig(A, B) = (k * (log k - log λ - 1)) / log n
> >> >
> >> > I'm a little confused since I know little about math or statistics. Why 
> >> > is the denominator omitted here?
> >> >
> >> > And some typos in the doc:
> >> > square of phi coefficient:
> >> > PHI^2 = ((n11 * n22) - (n21 * n21))^2/(n1p * np1 * np2 * n2p)
> >> > where n21 *n21 should be n12 * n21
> >> >
> >> > chi-squared test:
> >> > Pearson's Chi-squred test measures the devitation (should be deviation) 
> >> > between
> >> >
> >> > Pearson's Chi-Squared = 2 * [((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
> >> >                              ((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2]
> >> > should be: ((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
> >> >                    ((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2
> >> >
> >> > And chi2: same as above.
> >> >
> >> > Thanks in advance.

[ngram] Re: Some questions about Text-NSP

Reply via email to