Re: [ngram] search in file generated by statistic.pl

Ted Pedersen Wed, 25 Mar 2009 07:12:49 -0700

On Wed, Mar 25, 2009 at 6:50 AM, arezki20002002 <arezki20002...@yahoo.fr> wrote:
> Hello,
> once the file generated by statistic.pl
> how can I know a bigram appears in this file?
> thank you
> Arezki


Hi Arezki,

I tend to use the grep command to search through my statistics.pl
output...(when I'm looking for a specific ngram).
For example, I processed the biography "The Fabulous Life of Diego
Rivera" as follows....

count.pl fab.out fabulous-life-of-diego-rivera.txt

statistic.pl ll.pm fab-ll.out fab.out

Then I decided I wanted to find out if "Tina Modotti" occurred in that book...

marimba(22): grep "Tina<>Modotti" fab-ll.out
Tina<>Modotti<>146 231.4471 13 26 14

This tells me that she did (13 times) and that this was the 146th
ranked bigram (according to log-likelihood). Tina occurred 26 times
(as the first word of a bigram) and Modotti occurs 14 times (as the
second word of a bigram).

I also just searched for Modotti....

marimba(23): grep "Modotti" fab-ll.out
Tina<>Modotti<>146 231.4471 13 26 14
Modotti<>.<>1624 39.2108 9 14 7804
Modotti<>rejected<>6575 11.4513 1 14 17
Modotti<>served<>6641 11.3337 1 14 18
than<>Modotti<>11839 5.9592 1 262 14
Modotti<>was<>16621 2.5137 1 14 1611
Modotti<>and<>19152 0.8857 1 14 4451
Modotti<>,<>21349 0.0072 1 14 14352

Among other things, here I can see that Modotti is the second word of
two different bigrams (Tina Modotti, 13 times as we saw above, and
then as "than Modotti" 1 time, allowing us to confirm the total of 14
bigrams where Modotti is the second word...).

Fishing around like this can be quite fun. You could also use egrep to
specify regular expression patterns to search for (rather than just
strings), but I find grep to be a nice starting point.

I hope this is helpful!

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] search in file generated by statistic.pl

Reply via email to