[ngram] Re: Some questions about Text-NSP

Ted Pedersen tpede...@d.umn.edu [ngram] Sun, 25 Nov 2018 16:31:25 -0800

Thanks for these questions - all of the details are quite helpful. And
yes, I think your method for computing n12 and n22 are just fine.


As a historical note, it's worth pointing out the Fishing for
Exactness paper pre-dates Text-NSP by a number of years. This paper
was published 1996, and Text-NSP began in about 2002 and was actively
developed for several years thereafter. That said, when implementing
Text-NSP we were certainly basing it off of this earlier work and so
I'd hope the results from Text-NSP would be consistent with the paper.
To that end I ran the example you gave on Text-NSP and show the
results below. What you see is consistent with what you ran in python,
and so it seems pretty clear that the results from the paper are
indeed the two tailed test (contrary to what the paper says).

cat x.cnt
1382828
and<>industry<>22 30707 952

statistic.pl leftFisher x.left x.cnt

cat x.left
1382828
and<>industry<>1 0.6297 22 30707 952

statistic.pl rightFisher x.right x.cnt

cat x.right
1382828
and<>industry<>1 0.4546 22 30707 952

statistic.pl twotailed x.two x.cnt

cat x.two
1382828
and<>industry<>1 0.8253 22 30707 952

As to your more general question of what should be done, I will need
to refresh my recollection of this, although in general the
interpretation of left, right and two sided tests depend on your null
hypothesis. In our case, and for finding "dependent" bigrams in
general, the null hypothesis is that the two words are independent,
and so we are seeking evidence to either confirm or deny that
hypothesis. The left sided test (for Fisher's exact) is giving us the
p-value of n11 < 22. How to interpret that is where I need to refresh
my recollection, but that is the general direction things are heading.

I think a one sided test makes more sense for identifying dependent
bigrams, since in general if you have more occurrences than you expect
by chance, at some point beyond that expected value you are going to
decide it's not a chance occurrence. There is no value above the
expected value where you are going to say (I don't think) oh no, these
two words are no longer dependent on each other (ie they are occurring
too frequently to be dependent). I think a two tailed test makes the
most sense if there is a point both above and below the expected value
where your null hypothesis is potentially rejected.

In the case of "and industry" where the expected value is 21.14, it
seems very hard to argue that 22 occurrences is enough to say that
they are dependent. But, this is where I'm just a little foggy right
now. I'll look at this a little more and reply a bit more precisely.

I'm not sure about they keyword extraction case, but if you have an
example I'd be happy to think a little further about that as well!

More soon,
Ted
---
Ted Pedersen
http://www.d.umn.edu/~tpederseOn Sun, Nov 25, 2018 at 11:32 AM BLK
Serene <blkser...@gmail.com> wrote:
>
> Thanks for the clarification!
>
> And I have some other question about your paper "Fishing for Exactness"
>
> 1. The paper says that "In the test for association to determine bigram 
> dependence Fisher's exact test is interpreted as a left-sided test."
> And in last part "Experiment: Test for Association", it also says that "In 
> this experiment, we compare the significance values computed using the 
> t-test, the x2 approximation to the distribution of both G2 and X2 and 
> Fisher's exact test (left sided)".
> But as for the examples given in "Figure 8: test for association: <word> 
> industry":
> E.g. for word "and", the given data is:
>     n++ (total number of tokens in the corpus): 1382828 (taken from "Figure 
> 3")
>     n+1 (total frequency of "industry"): 952 (taken from "Figure 3")
>
>     n11 = 22
>     n21 = 952 - 22 = 930
>
> Since n12 is not given in the table, I have to compute it by
>     m11 = n1+ * n+1 / n++
>     so n1+ is 21.14 * 1382828 / 952 = 30706.915882352943 (approximately 30707)
>
> And then:
>     n12 = 30707 - 22 = 30685
>     n22 = 1382828 - 952 - 30707 + 22 = 1351191
>
> I'm not sure if my calculation is correct, but when using n11 = 22, n12 = 
> 30685, n21 = 930, n22 = 1351191 as the input, the left-sided fisher's exact 
> test gives the result 0.6296644386744733 which is not matched with 0.8255 
> given in the example. I use Python's Scipy module to calculate this:
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> >>> 'less') # the parameter "alternative" specifies the left-sided test be 
> >>> used
> (1.041670459980972, 0.6296644386744733) # The first value is Odds Ratio 
> (irrelevant), the second is the p-value given by Fisher's exact test
>
> Then I tried the two-tailed test, which gave the expected value 
> (approximately):
>
> >>> scipy.stats.fisher_exact([[22, 30685], [930, 1351191]], alternative = 
> >>> 'two-sided') # Two-sided test
> (1.041670459980972, 0.8253462481347)
>
> So I suppose that the results given in the figure is actually calculated 
> using the two-sided Fisher's exact test (is it a mistake or that the 
> two-sided test should be used instead?)
>
> 2. I've noticed the left-sided, right-sided, two-sided Fisher's exact test 
> are all implemented in NSP, so which one is preferred in general case?(or it 
> has to be determined by the purpose of the research?). Since I'm writing a 
> corpus tool to be used by myself and other researchers, implementing too many 
> similar significance tests would confuse those who know little about math or 
> statistics.
>
> 3. The paper discuss mainly about the context of collocation identification 
> (two words in the same corpus), but it is cited in "Embracing Bayes Factors 
> for Key Item Analysis in Corpus Linguistics" (Wilson, 3), which talked about 
> measures used in keyword extraction (same word in two different corpus). So 
> I'm wondering that if it is suitable to use Fisher's exact test in the 
> context of both collocation identification and keyword extraction.
>
> Sorry for so many questions, thanks in advance.
>
> On Sun, Nov 25, 2018 at 10:54 PM Ted Pedersen <tpede...@d.umn.edu> wrote:
>>
>> Hi Blk,
>>
>> Thanks for pointing these out. On the Poisson Stirling measure, I
>> think the reason we haven't included log n is that log n would simply
>> be a constant (log of the total number of bigrams) and so would not
>> change the rankings that we get from these scores. That said, if you
>> were comparing scores across different sized corpora then the
>> denominator would likely be important to include.
>>
>> Thanks for pointing out the typos. Text-NSP is right now in a fairly
>> dormant state, but I do have a list of small changes to make and will
>> add yours to these.
>>
>> Thanks for your interest, and please let us know if you have any other
>> questions.
>>
>> Cordially,
>> Ted
>> ---
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>>
>> On Sun, Nov 25, 2018 at 4:13 AM BLK Serene <blkser...@gmail.com> wrote:
>> >
>> > Hi, I have some questions about the association measures implemented in 
>> > Text-NSP:
>> >
>> > The Poisson-Sterling Measure given in the documentation is:
>> > Poisson-Stirling = n11 * ( log(n11) - log(m11) - 1)
>> >
>> > But in Quasthoff's paper the formulae given by the author is:
>> > sig(A, B) = (k * (log k - log λ - 1)) / log n
>> >
>> > I'm a little confused since I know little about math or statistics. Why is 
>> > the denominator omitted here?
>> >
>> > And some typos in the doc:
>> > square of phi coefficient:
>> > PHI^2 = ((n11 * n22) - (n21 * n21))^2/(n1p * np1 * np2 * n2p)
>> > where n21 *n21 should be n12 * n21
>> >
>> > chi-squared test:
>> > Pearson's Chi-squred test measures the devitation (should be deviation) 
>> > between
>> >
>> > Pearson's Chi-Squared = 2 * [((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
>> >                              ((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2]
>> > should be: ((n11 - m11)/m11)^2 + ((n12 - m12)/m12)^2 +
>> >                    ((n21 - m21)/m21)^2 + ((n22 -m22)/m22)^2
>> >
>> > And chi2: same as above.
>> >
>> > Thanks in advance.

[ngram] Re: Some questions about Text-NSP

Reply via email to