Re: [ngram] dialogue bigrams

ted pedersen Fri, 12 Nov 2004 13:28:50 -0800


Hi Kate,


> We are trying to use nsp to find dependent bigrams in a corpus of
> annotated tutoring dialogues, and we have a few (probably pretty basic)
> question:

Wow! This is really really neat. I'm so amazed that someone thought to do
this. Really. I am so used to counting up collocations I have a hard
time thinking of using nsp for anything else. But I'm glad other people
have more vision that I do. :)

> The tutor turns in our dialogues are annotated with "Tutor
> Moves", and the student turns are annotated with "Student Moves".
> At an abstract level, the dialogues look like this:
>
> Tutor Move 0
> Student Move 1
> Tutor Move 2
> Student Move 3
> ...
>
> We've used nsp to extract all the "Move" bigrams in this data, e.g.:
>
> Tutor Move 0<>Student Move 1
> Student Move 1<>Tutor Move 2
> ...

This is very clever.

>
> However, we are *only* investigating whether any of the
> Student Move<>Tutor Move bigrams are dependent. In other words, we're
> asking the question:
>
>       Of all the possible Student Move<>Tutor Move bigrams, are there
>       any that occur more frequently than predicted by chance?

Got it! Put another way, is there evidence of an association of some sort
between the student move and the tutor move. BTW, note that the NSP tests
themselves do not assume any "directionality". In other words, it is not
possible to say that student move causes tutor move, or something like
that, at least not on the basis of the NSP measures. The way you have
written the above suggests to me that you understand this already, but
just thought I'd throw that in.

> For example, suppose our Student Moves can be: "correct answer" or
> "incorrect answer" and our Tutor Moves can be: "positive feedback" or
> "negative feedback". Then the bigrams we are interested in are:
>
> correct answer<>positive feedback
> correct answer<>negative feedback
> incorrect answer<>positive feedback
> incorrect answer<>negative feedback

OK, this makes sense.

>
> However, count.pl gives us these bigrams and those shown below, where
> the Tutor Move is the first element in the bigram:
>
> positive feedback<>correct answer
> negative feedback<>correct answer
> positive feedback<>incorrect answer
> negative feedback<>incorrect answer
>

Right, NSP will just chop your data up into two "word" sequences so if
you have w1 w2 w3 w4

you will get bigrams

w1 w2
w2 w3
w3 w4

>
> Question 1: is it ok that count.pl totals up all the bigrams, or will
> this skew the statistical results? In other words, should we only be
> totaling up bigrams where a Student Move is the first element?

Interesting question. The sample size is affected by the number of bigrams
you count, and indeed that has a big effect on the results of the tests.

If you are saying that

positive feedback<>correct answer

is an "invalid" combination, then I think you probably want to throw and
their like out, because these will be added to your sample size.

(By invalid combination I mean this combination of events in this order
will never and can never occur.)

In the context of bigrams, we are always thinking of representing two
random variables

WORD1 and WORD2

where WORD1 is presumed to be the first word in the bigram, and WORD2 is
the second word. Now, given the nature of text, a word that is the second
word in a bigram is going to be the first word in a bigram, unless it
is at the end of a file OR at the end of a line. And I think this is where
count.pl can come to your aid.

Now, in your case, each of your bigrams should be thought of as being its
own sentence. In other words, let us suppose your input file to count.pl
looks like this (lets call it tutor.dat):

correct_answer  positive_feedback
correct_answer  negative_feedback
incorrect_answer positive_feedback
incorrect_answer negative_feedback

Now, if you do

count.pl tutor.out tutor.dat

This is tutor.out...

7
negative_feedback<>incorrect_answer<>1 1 2
positive_feedback<>incorrect_answer<>1 2 2
incorrect_answer<>positive_feedback<>1 2 2
positive_feedback<>correct_answer<>1 2 1
correct_answer<>negative_feedback<>1 2 2
correct_answer<>positive_feedback<>1 2 2
incorrect_answer<>negative_feedback<>1 2 2

...which doesn't make sense, because you want your first column to represent 
the random variable STUDENT-MOVE and the second column to represent the random 
variable TUTOR-MOVE. So you have ill defined random variables, your sample size 
is too large, it's all just chaos really :)

Enter the --newLine option.

count.pl --newLine tutor1.out tutor.dat

This is tutor1.out...

4
incorrect_answer<>positive_feedback<>1 2 2
correct_answer<>negative_feedback<>1 2 2
correct_answer<>positive_feedback<>1 2 2
incorrect_answer<>negative_feedback<>1 2 2

This looks good. You have two well defined random variables represented
in each column (STUDENT-MOVE and TUTOR-MOVE), and your sample
size is correct.

So, you are right to be suspicious of counting up everything together.
Fortunately --newLine makes it easy to avoid that.

Let me just make sure that I'm understanding things right with a
rerepresentation of the following:

4
incorrect_answer<>positive_feedback<>1 2 2

this means...

            positive  !positive
incorrect       1*        1      |    2*
!incorrect      1         1      |    2
               -------------------
                2*        2           4*

In othe words, reading across the first row there is one combination where the 
incorrect answer was followed by positive feedback (1*), one combination where 
the incorrect answer was followed by non positive feedback and 2 combinations 
where incorrect answer as the student response (2*). Hopefully that accurately
represents your situation. 

[The starred values are coming from the count.pl output, and the
rest can be worked out from that.]

> Question 2: given that we have about 2000 Student Moves (8 different
> types) and 2000 Tutor Moves (12 different types), should we be using x2 of
> leftFisher to compute significance?

Exact tests are technically speaking intended for rather small samples,
although we violate this pretty violently sometimes (using it for million
word samples). But, Fisher himself introduced the exact tests via the "Tea
Drinker's Problem" which had a sample size of 8. So, I don't think an
exact test would be your first choice. Rather, I'd try x2 or ll (these are
Pearson's test and the Log-Likelihood ratio.)

Here's a trick you can try. Run statistic.pl with x2 and ll for your data,
and see if the scores are approximately the same. If they are, then the
data is well behaved, because both of these measures are aymptotically
approximated by the chi-squared distribution and if the data was overly
skewed these scores would be different.

Now, you will just get back the raw scores from x2 and ll, not significant 
values, but you can assign significance to those by looking up the critical 
values for the chi squared distribution in a table in the back of a statistics 
book. Here's a quick cheat sheet:

 alpha score
 0.1   2.706
 0.05  3.841
 0.01  6.635
 0.005 7.879

So the scores are what comes from the tests, and the alpha is the
bcritical value for the test showing significance with (1-alpha)% of
certainty. alpha =.05 and .01 are commonly used, but that sort of depends
a little. Generally speaking if you decide that an alpha of .05 is appropriate 
for your data, you can enforce that by 

statistic.pl ll --score 3.841 tutor1-ll.output tutor1.out

This will only show the "bigrams" that have a score greater than or equal to 
3.841, which will be those you'll want to consider significant. I don't know if 
it's a good idea to start with a strict score cutoff like this, usually it's 
better to examine the data a little and see how things are looking.

So, if the data is well behaved then either x2 or ll will probably be just
fine. But, if you see a difference in those values, then you may want to
consider the use of an exact test (Fisher's left or right) as it doesn't
make those same asypotitic assumptions.

If your data looks ill behaved, then if you do start using Fisher's test, be 
careful regarding the distribution of the data. Right now, in version 0.71 of 
NSP, there is a problem if your n22 value is less than your n11 value.

That would be a situation like...

            positive  !positive
incorrect       1         1      |    2
!incorrect      1         3      |    4
               -------------------
                2         4           6

You can read recent email (from yesterday even!) about this situation, or
you can go back into the archives and find:

http://groups.yahoo.com/group/ngram/message/15
http://groups.yahoo.com/group/ngram/message/17

This problem is rising in priority even as we speak, so hopefully we'll
have a fix for it shortly. (A fix has been in fact suggested by a user,
and that is posted to the list, but we'll go back and double check things
just to be sure).

OK, sorry for all this. I'm sure some of this is already well known to
you, but it seemed like a good opportunity to review some of these issues
for the benefit of other users (and myself!) too.

I hope this helps. Please let us know how this works out, or if you have
any other questions (at all). This sounds like a very clever use of NSP,
and I'm very glad to see you doing this!

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse





------------------------ Yahoo! Groups Sponsor --------------------~--> 
Make a clean sweep of pop-up ads. Yahoo! Companion Toolbar.
Now with Pop-Up Blocker. Get it for free!
http://us.click.yahoo.com/L5YrjA/eSIIAA/yQLSAA/dpFolB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ngram/

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

Re: [ngram] dialogue bigrams

Reply via email to