Greetings all,

On Mon, Feb 23, 2009 at 8:11 AM, robsteranium <robsteran...@yahoo.co.uk> wrote:
> I would run count.pl without a stop list (or perhaps just a normal
> stop list) and then process the output in another program (e.g. sed).
> This one liner would do the trick:
>
> sed -ne '/XTOPOX/p' count-output.cnt
>

I think this is a great suggestion. One could also use grep or egrep
from the command line too, but sed is quite powerful and useful to
know about.

I don't know if I recommend the following, but it does make some
useful points about using regular expressions in NSP, so I thought I
would pass it along.

This is my input:

marimba(51): cat testin
I went to XTOPOX and then left XTOPOX.
I like XTOPOX pretty well, but I don't think you like XTOPOX at all.
Where is XTOPOX XTOPOX is in Namibia

I would like to find all the bigrams that include XTOPOX. Another
option would be to define the tokens such that they are two words long
and must include XTOPOX as either the first or second word. Note that
we will need to treat tokens of this form as unigrams in order to get
the desired result.

I define my tokens as follows:

marimba(52): cat token.txt
/\b\w+\b\s+\bXTOPOX\b/
/\bXTOPOX\b\s+\b\w+\b/

And then run count.pl using that as the tokenization file....

marimba(54): count.pl --ngram 1 testout testin --token token.txt

My output is as follows...

marimba(55): cat testout
6
like XTOPOX<>2
XTOPOX is<>1
is XTOPOX<>1
left XTOPOX<>1
to XTOPOX<>1

Note that we are getting counts here of how many times this two word
string has occurred, and that these two word strings are indivisible
(these are the tokens, the atomic units of our counting...)

If you were to run this without using --ngram 1 you would get --ngram
2 by default, and the results will look rather odd...

marimba(111): count.pl --ngram 2 test2out testin --token token.txt

marimba(112): more test2out
5
is XTOPOX<>XTOPOX is<>1 1 1
like XTOPOX<>like XTOPOX<>1 2 2
left XTOPOX<>like XTOPOX<>1 1 2
to XTOPOX<>left XTOPOX<>1 1 1
like XTOPOX<>is XTOPOX<>1 2 1

However, it makes sense in that it's giving bigram counts for these
two word unigrams (separated by the <>).

Again, I don't know that I recommend this as a solution, but I think
it makes some useful points about using --token to represent the data
in a somewhat different than expected way.

I hope this all helps!

Cordially,
Ted
-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to