Greetings all, On Mon, Feb 23, 2009 at 8:11 AM, robsteranium <robsteran...@yahoo.co.uk> wrote: > I would run count.pl without a stop list (or perhaps just a normal > stop list) and then process the output in another program (e.g. sed). > This one liner would do the trick: > > sed -ne '/XTOPOX/p' count-output.cnt >
I think this is a great suggestion. One could also use grep or egrep from the command line too, but sed is quite powerful and useful to know about. I don't know if I recommend the following, but it does make some useful points about using regular expressions in NSP, so I thought I would pass it along. This is my input: marimba(51): cat testin I went to XTOPOX and then left XTOPOX. I like XTOPOX pretty well, but I don't think you like XTOPOX at all. Where is XTOPOX XTOPOX is in Namibia I would like to find all the bigrams that include XTOPOX. Another option would be to define the tokens such that they are two words long and must include XTOPOX as either the first or second word. Note that we will need to treat tokens of this form as unigrams in order to get the desired result. I define my tokens as follows: marimba(52): cat token.txt /\b\w+\b\s+\bXTOPOX\b/ /\bXTOPOX\b\s+\b\w+\b/ And then run count.pl using that as the tokenization file.... marimba(54): count.pl --ngram 1 testout testin --token token.txt My output is as follows... marimba(55): cat testout 6 like XTOPOX<>2 XTOPOX is<>1 is XTOPOX<>1 left XTOPOX<>1 to XTOPOX<>1 Note that we are getting counts here of how many times this two word string has occurred, and that these two word strings are indivisible (these are the tokens, the atomic units of our counting...) If you were to run this without using --ngram 1 you would get --ngram 2 by default, and the results will look rather odd... marimba(111): count.pl --ngram 2 test2out testin --token token.txt marimba(112): more test2out 5 is XTOPOX<>XTOPOX is<>1 1 1 like XTOPOX<>like XTOPOX<>1 2 2 left XTOPOX<>like XTOPOX<>1 1 2 to XTOPOX<>left XTOPOX<>1 1 1 like XTOPOX<>is XTOPOX<>1 2 1 However, it makes sense in that it's giving bigram counts for these two word unigrams (separated by the <>). Again, I don't know that I recommend this as a solution, but I think it makes some useful points about using --token to represent the data in a somewhat different than expected way. I hope this all helps! Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse