Hi Kate,
> We are trying to use nsp to find dependent bigrams in a corpus of > annotated tutoring dialogues, and we have a few (probably pretty basic) > question: Wow! This is really really neat. I'm so amazed that someone thought to do this. Really. I am so used to counting up collocations I have a hard time thinking of using nsp for anything else. But I'm glad other people have more vision that I do. :) > The tutor turns in our dialogues are annotated with "Tutor > Moves", and the student turns are annotated with "Student Moves". > At an abstract level, the dialogues look like this: > > Tutor Move 0 > Student Move 1 > Tutor Move 2 > Student Move 3 > ... > > We've used nsp to extract all the "Move" bigrams in this data, e.g.: > > Tutor Move 0<>Student Move 1 > Student Move 1<>Tutor Move 2 > ... This is very clever. > > However, we are *only* investigating whether any of the > Student Move<>Tutor Move bigrams are dependent. In other words, we're > asking the question: > > Of all the possible Student Move<>Tutor Move bigrams, are there > any that occur more frequently than predicted by chance? Got it! Put another way, is there evidence of an association of some sort between the student move and the tutor move. BTW, note that the NSP tests themselves do not assume any "directionality". In other words, it is not possible to say that student move causes tutor move, or something like that, at least not on the basis of the NSP measures. The way you have written the above suggests to me that you understand this already, but just thought I'd throw that in. > For example, suppose our Student Moves can be: "correct answer" or > "incorrect answer" and our Tutor Moves can be: "positive feedback" or > "negative feedback". Then the bigrams we are interested in are: > > correct answer<>positive feedback > correct answer<>negative feedback > incorrect answer<>positive feedback > incorrect answer<>negative feedback OK, this makes sense. > > However, count.pl gives us these bigrams and those shown below, where > the Tutor Move is the first element in the bigram: > > positive feedback<>correct answer > negative feedback<>correct answer > positive feedback<>incorrect answer > negative feedback<>incorrect answer > Right, NSP will just chop your data up into two "word" sequences so if you have w1 w2 w3 w4 you will get bigrams w1 w2 w2 w3 w3 w4 > > Question 1: is it ok that count.pl totals up all the bigrams, or will > this skew the statistical results? In other words, should we only be > totaling up bigrams where a Student Move is the first element? Interesting question. The sample size is affected by the number of bigrams you count, and indeed that has a big effect on the results of the tests. If you are saying that positive feedback<>correct answer is an "invalid" combination, then I think you probably want to throw and their like out, because these will be added to your sample size. (By invalid combination I mean this combination of events in this order will never and can never occur.) In the context of bigrams, we are always thinking of representing two random variables WORD1 and WORD2 where WORD1 is presumed to be the first word in the bigram, and WORD2 is the second word. Now, given the nature of text, a word that is the second word in a bigram is going to be the first word in a bigram, unless it is at the end of a file OR at the end of a line. And I think this is where count.pl can come to your aid. Now, in your case, each of your bigrams should be thought of as being its own sentence. In other words, let us suppose your input file to count.pl looks like this (lets call it tutor.dat): correct_answer positive_feedback correct_answer negative_feedback incorrect_answer positive_feedback incorrect_answer negative_feedback Now, if you do count.pl tutor.out tutor.dat This is tutor.out... 7 negative_feedback<>incorrect_answer<>1 1 2 positive_feedback<>incorrect_answer<>1 2 2 incorrect_answer<>positive_feedback<>1 2 2 positive_feedback<>correct_answer<>1 2 1 correct_answer<>negative_feedback<>1 2 2 correct_answer<>positive_feedback<>1 2 2 incorrect_answer<>negative_feedback<>1 2 2 ...which doesn't make sense, because you want your first column to represent the random variable STUDENT-MOVE and the second column to represent the random variable TUTOR-MOVE. So you have ill defined random variables, your sample size is too large, it's all just chaos really :) Enter the --newLine option. count.pl --newLine tutor1.out tutor.dat This is tutor1.out... 4 incorrect_answer<>positive_feedback<>1 2 2 correct_answer<>negative_feedback<>1 2 2 correct_answer<>positive_feedback<>1 2 2 incorrect_answer<>negative_feedback<>1 2 2 This looks good. You have two well defined random variables represented in each column (STUDENT-MOVE and TUTOR-MOVE), and your sample size is correct. So, you are right to be suspicious of counting up everything together. Fortunately --newLine makes it easy to avoid that. Let me just make sure that I'm understanding things right with a rerepresentation of the following: 4 incorrect_answer<>positive_feedback<>1 2 2 this means... positive !positive incorrect 1* 1 | 2* !incorrect 1 1 | 2 ------------------- 2* 2 4* In othe words, reading across the first row there is one combination where the incorrect answer was followed by positive feedback (1*), one combination where the incorrect answer was followed by non positive feedback and 2 combinations where incorrect answer as the student response (2*). Hopefully that accurately represents your situation. [The starred values are coming from the count.pl output, and the rest can be worked out from that.] > Question 2: given that we have about 2000 Student Moves (8 different > types) and 2000 Tutor Moves (12 different types), should we be using x2 of > leftFisher to compute significance? Exact tests are technically speaking intended for rather small samples, although we violate this pretty violently sometimes (using it for million word samples). But, Fisher himself introduced the exact tests via the "Tea Drinker's Problem" which had a sample size of 8. So, I don't think an exact test would be your first choice. Rather, I'd try x2 or ll (these are Pearson's test and the Log-Likelihood ratio.) Here's a trick you can try. Run statistic.pl with x2 and ll for your data, and see if the scores are approximately the same. If they are, then the data is well behaved, because both of these measures are aymptotically approximated by the chi-squared distribution and if the data was overly skewed these scores would be different. Now, you will just get back the raw scores from x2 and ll, not significant values, but you can assign significance to those by looking up the critical values for the chi squared distribution in a table in the back of a statistics book. Here's a quick cheat sheet: alpha score 0.1 2.706 0.05 3.841 0.01 6.635 0.005 7.879 So the scores are what comes from the tests, and the alpha is the bcritical value for the test showing significance with (1-alpha)% of certainty. alpha =.05 and .01 are commonly used, but that sort of depends a little. Generally speaking if you decide that an alpha of .05 is appropriate for your data, you can enforce that by statistic.pl ll --score 3.841 tutor1-ll.output tutor1.out This will only show the "bigrams" that have a score greater than or equal to 3.841, which will be those you'll want to consider significant. I don't know if it's a good idea to start with a strict score cutoff like this, usually it's better to examine the data a little and see how things are looking. So, if the data is well behaved then either x2 or ll will probably be just fine. But, if you see a difference in those values, then you may want to consider the use of an exact test (Fisher's left or right) as it doesn't make those same asypotitic assumptions. If your data looks ill behaved, then if you do start using Fisher's test, be careful regarding the distribution of the data. Right now, in version 0.71 of NSP, there is a problem if your n22 value is less than your n11 value. That would be a situation like... positive !positive incorrect 1 1 | 2 !incorrect 1 3 | 4 ------------------- 2 4 6 You can read recent email (from yesterday even!) about this situation, or you can go back into the archives and find: http://groups.yahoo.com/group/ngram/message/15 http://groups.yahoo.com/group/ngram/message/17 This problem is rising in priority even as we speak, so hopefully we'll have a fix for it shortly. (A fix has been in fact suggested by a user, and that is posted to the list, but we'll go back and double check things just to be sure). OK, sorry for all this. I'm sure some of this is already well known to you, but it seemed like a good opportunity to review some of these issues for the benefit of other users (and myself!) too. I hope this helps. Please let us know how this works out, or if you have any other questions (at all). This sounds like a very clever use of NSP, and I'm very glad to see you doing this! Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------ Yahoo! Groups Sponsor --------------------~--> Make a clean sweep of pop-up ads. Yahoo! Companion Toolbar. Now with Pop-Up Blocker. Get it for free! http://us.click.yahoo.com/L5YrjA/eSIIAA/yQLSAA/dpFolB/TM --------------------------------------------------------------------~-> Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/ngram/ <*> To unsubscribe from this group, send an email to: [EMAIL PROTECTED] <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/