When you are counting longer ngrams (--ngram > 2) with the Ngram Statistics Package, you might not want to get all the counts that NSP provides, since you may not actually need them, and it will save some time if you don't collect them. This is what the --set_freq_combo option to count.pl is meant to support.
For example, when you run... count.pl --ngram 3 output wrnpc12.txt You will get a file named output that starts off like this... 654831 .<>.<>.<>2138 30682 30682 30682 3929 2340 3929 ,<>and<>the<>568 39925 21186 31863 6643 2402 1390 .<>It<>was<>285 30682 899 7342 747 1046 308 .<>Well<>,<>279 30682 387 39925 322 2119 335 .<>Yes<>,<>274 30682 413 39925 300 2119 371 .<>Prince<>Andrew<>240 30682 1579 1142 336 243 1071 (USED AS EXAMPLE BELOW) ,<>and<>he<>240 39925 21186 8134 6643 913 389 The first number shown after each 3gram is the count of the number of times the 3gram has occurred (we refer to this as f(0,1,2), meaning the frequency of the 0th, 1st and 2nd word in the ngram together). Then you get f(0), f(1), f(2), f(0,1), f(0,2) f(1,2)) ... you can see a more complete explanation of this here : http://search.cpan.org/dist/Text-NSP/doc/README.pod#5.5._The_Output_Format_of_count.pl So, applying this to the ngram ". Prince Andrew", we can see from the output above that this trigram occurs 240 times. Then we get the unigram counts : "." occurs 30,682 times, "Prince" occurs 1,579 times, and "Andrew" occurs 1,142 times. Finally, the "bigram" counts are shown - although these have a slightly different interpretation when referring to words that aren't adjacent...f(0,1) and f(1,2) are adjacent, while f(0,2) is not... The adjacent bigram f(0,1) ". Prince" occurs 336 times, the non-adjacent bigram f(0.2) ". Andrew" occurs 243 times, and the bigram "Prince Andrew" occurs 1,071 times. Note that if you run count.pl with --ngram 2 for this same data you will find that the bigram counts for ". Prince" and "Prince Andrew" agree with those reported here. However, the count for ". Andrew" f(0,2) is not the same as an adjacent bigram count. What these "non-adjacent" bigrams really represent is the count of ". * Andrew" (where * is a wildcard) - that is how often do "." and "Andrew" occur together with just a single word in between them. This is *almost* like using --window 2 --ngram 3 with count, except that with the window option you get the count of ". Andrew" and ". * Andrew", whereas f(0.2) is just the latter. Anyway....that's what you get by default. It might be though that you don't want all these counts, and that you simply care about the ngram counts themselves. If that is the case, then --set_freq_comb is for you. :) If you just want the trigram counts, you can specify a file called mycomb0.txt that contains the following: 0 1 2 If you then run count.pl as follows... count.pl --ngram 3 --set_freq_comb mycomb0.txt output0 wrnpc12.txt Then you get a file called output0 that looks like this... 654831 .<>.<>.<>2138 ,<>and<>the<>568 .<>It<>was<>285 .<>Well<>,<>279 .<>Yes<>,<>274 .<>Prince<>Andrew<>240 ,<>and<>he<>240 This might be all that you want, and the good news is it is much faster to obtain. :) What if you want both the trigram and unigram counts? No problem....specify a file called mycomb1.txt that contains the following: 0 1 2 0 1 2 Then you can run as follows: count.pl --ngram 3 --set_freq_comb mycomb1.txt output1 wrnpc12.txt And you get an output file that looks like this... 654831 .<>.<>.<>2138 30682 30682 30682 ,<>and<>the<>568 39925 21186 31863 .<>It<>was<>285 30682 899 7342 .<>Well<>,<>279 30682 387 39925 .<>Yes<>,<>274 30682 413 39925 .<>Prince<>Andrew<>240 30682 1579 1142 ,<>and<>he<>240 39925 21186 8134 Now, finally, here's some actual "proof" - when using set_freq_combo to find just the trigram counts, the total time taken is 25 seconds. When you get the trigram and unigram counts, the total time is about 35 seconds. If you get the unigram, bigram, and trigram counts, the total time is 48 seconds. As your ngrams get longer or your data gets larger, these savings become more and more dramatic. %time count.pl --ngram 3 --set_freq_combo mycomb0.txt output0 wrnpc12.txt 24.725u 0.248s 0:35.57 70.1% 0+0k 0+19144io 0pf+0w %time count.pl --ngram 3 --set_freq_combo mycomb1.txt output1 wrnpc12.txt 34.518u 0.284s 0:52.53 66.2% 0+0k 0+30504io 0pf+0w %time count.pl --ngram 3 output wrnpc12.txt 48.435u 0.476s 1:13.20 66.8% 0+0k 0+37096io 0pf+0w BTW, all of the above is on War and Peace, which is a long novel but a fairly small corpus, so your savings will be more dramatic on larger corpora.... %wc wrnpc12.txt 67403 564514 3285165 wrnpc12.txt In any case, if you are using longer ngrams, I think it's very likely you will find set_freq_comb very handy. Please don't hesitate to let us know if you have any questions on how to use or interpret this. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse