[ngram] counting longer n-grams with --set_freq_comb

Ted Pedersen Wed, 26 Nov 2008 08:27:13 -0800

When you are counting longer ngrams (--ngram > 2) with the Ngram
Statistics Package, you might not want to get all the counts that NSP
provides, since you may not actually need them, and it will save some
time if you don't collect them. This is what the --set_freq_combo
option to count.pl is meant to support.


For example, when you run...

count.pl --ngram 3 output wrnpc12.txt

You will get a file named output that starts off like this...

654831
.<>.<>.<>2138 30682 30682 30682 3929 2340 3929
,<>and<>the<>568 39925 21186 31863 6643 2402 1390
.<>It<>was<>285 30682 899 7342 747 1046 308
.<>Well<>,<>279 30682 387 39925 322 2119 335
.<>Yes<>,<>274 30682 413 39925 300 2119 371
.<>Prince<>Andrew<>240 30682 1579 1142 336 243 1071       (USED AS
EXAMPLE BELOW)
,<>and<>he<>240 39925 21186 8134 6643 913 389

The first number shown after each 3gram is the count of the number of
times the 3gram has occurred (we refer to this as f(0,1,2), meaning
the frequency of the 0th, 1st and 2nd word in the ngram together).
Then you get f(0), f(1), f(2), f(0,1), f(0,2) f(1,2)) ... you can see
a more complete explanation of this here :

http://search.cpan.org/dist/Text-NSP/doc/README.pod#5.5._The_Output_Format_of_count.pl

So, applying this to the ngram ". Prince Andrew", we can see from the
output above that this trigram occurs 240 times. Then we get the
unigram counts :  "." occurs 30,682 times, "Prince" occurs 1,579
times, and "Andrew" occurs 1,142 times.

Finally, the "bigram" counts are shown - although these have a
slightly different interpretation when referring to words that aren't
adjacent...f(0,1) and f(1,2) are adjacent, while f(0,2) is not...

The adjacent bigram f(0,1) ". Prince" occurs 336 times, the
non-adjacent bigram f(0.2) ".  Andrew" occurs 243 times, and the
bigram "Prince Andrew" occurs 1,071 times. Note that if you run
count.pl with --ngram 2 for this same data you will find that the
bigram counts for ". Prince" and "Prince Andrew" agree with those
reported here. However, the count for ". Andrew" f(0,2) is not the
same as an adjacent bigram count. What these "non-adjacent" bigrams
really represent is the count of ". * Andrew" (where * is a wildcard)
- that is how often do "." and "Andrew" occur together with just a
single word in between them. This is *almost* like using --window 2
--ngram 3 with count, except that with the window option you get the
count of ". Andrew" and ". * Andrew", whereas f(0.2) is just the
latter.

Anyway....that's what you get by default. It might be though that you
don't want all these counts, and that you simply care about the ngram
counts themselves. If that is the case, then --set_freq_comb is for
you. :)

If you just want the trigram counts, you can specify a file called
mycomb0.txt that contains the following:

0 1 2

If you then run count.pl as follows...

count.pl --ngram 3 --set_freq_comb mycomb0.txt output0 wrnpc12.txt

Then you get a file called output0 that looks like this...

654831
.<>.<>.<>2138
,<>and<>the<>568
.<>It<>was<>285
.<>Well<>,<>279
.<>Yes<>,<>274
.<>Prince<>Andrew<>240
,<>and<>he<>240

This might be all that you want, and the good news is it is much
faster to obtain. :)

What if you want both the trigram and unigram counts? No
problem....specify a file called mycomb1.txt that contains the
following:

0 1 2
0
1
2

Then you can run as follows:

count.pl --ngram 3 --set_freq_comb mycomb1.txt output1 wrnpc12.txt

And you get an output file that looks like this...

654831
.<>.<>.<>2138 30682 30682 30682
,<>and<>the<>568 39925 21186 31863
.<>It<>was<>285 30682 899 7342
.<>Well<>,<>279 30682 387 39925
.<>Yes<>,<>274 30682 413 39925
.<>Prince<>Andrew<>240 30682 1579 1142
,<>and<>he<>240 39925 21186 8134

Now, finally, here's some actual "proof" - when using set_freq_combo
to find just the trigram counts, the total time taken is 25 seconds.
When you get the trigram and unigram counts, the total time is about
35 seconds. If you get the unigram, bigram, and trigram counts, the
total time is 48 seconds. As your ngrams get longer or your data gets
larger, these savings become more and more dramatic.

%time count.pl --ngram 3 --set_freq_combo mycomb0.txt output0 wrnpc12.txt
24.725u 0.248s 0:35.57 70.1%    0+0k 0+19144io 0pf+0w

%time count.pl --ngram 3 --set_freq_combo mycomb1.txt output1 wrnpc12.txt
34.518u 0.284s 0:52.53 66.2%    0+0k 0+30504io 0pf+0w

%time count.pl --ngram 3 output wrnpc12.txt
48.435u 0.476s 1:13.20 66.8%    0+0k 0+37096io 0pf+0w

BTW, all of the above is on War and Peace, which is a long novel but a
fairly small corpus, so your savings will be more dramatic on larger
corpora....

%wc wrnpc12.txt
67403  564514 3285165 wrnpc12.txt

In any case, if you are using longer ngrams, I think it's very likely
you will find set_freq_comb very handy. Please don't hesitate to let
us know if you have any questions on how to use or interpret this.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[ngram] counting longer n-grams with --set_freq_comb

Reply via email to