Dear all, Thanks all for the fast help and the very comprehensive inputs. I ended up in keeping it simple and wrote a perl script to filter the XTOPOX ngrams out on my own.
thanks again, Florian --- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote: > > Greetings all, > > On Mon, Feb 23, 2009 at 8:11 AM, robsteranium <robsteran...@...> wrote: > > I would run count.pl without a stop list (or perhaps just a normal > > stop list) and then process the output in another program (e.g. sed). > > This one liner would do the trick: > > > > sed -ne '/XTOPOX/p' count-output.cnt > > > > I think this is a great suggestion. One could also use grep or egrep > from the command line too, but sed is quite powerful and useful to > know about. > > I don't know if I recommend the following, but it does make some > useful points about using regular expressions in NSP, so I thought I > would pass it along. > > This is my input: > > marimba(51): cat testin > I went to XTOPOX and then left XTOPOX. > I like XTOPOX pretty well, but I don't think you like XTOPOX at all. > Where is XTOPOX XTOPOX is in Namibia > > I would like to find all the bigrams that include XTOPOX. Another > option would be to define the tokens such that they are two words long > and must include XTOPOX as either the first or second word. Note that > we will need to treat tokens of this form as unigrams in order to get > the desired result. > > I define my tokens as follows: > > marimba(52): cat token.txt > /\b\w+\b\s+\bXTOPOX\b/ > /\bXTOPOX\b\s+\b\w+\b/ > > And then run count.pl using that as the tokenization file.... > > marimba(54): count.pl --ngram 1 testout testin --token token.txt > > My output is as follows... > > marimba(55): cat testout > 6 > like XTOPOX<>2 > XTOPOX is<>1 > is XTOPOX<>1 > left XTOPOX<>1 > to XTOPOX<>1 > > Note that we are getting counts here of how many times this two word > string has occurred, and that these two word strings are indivisible > (these are the tokens, the atomic units of our counting...) > > If you were to run this without using --ngram 1 you would get --ngram > 2 by default, and the results will look rather odd... > > marimba(111): count.pl --ngram 2 test2out testin --token token.txt > > marimba(112): more test2out > 5 > is XTOPOX<>XTOPOX is<>1 1 1 > like XTOPOX<>like XTOPOX<>1 2 2 > left XTOPOX<>like XTOPOX<>1 1 2 > to XTOPOX<>left XTOPOX<>1 1 1 > like XTOPOX<>is XTOPOX<>1 2 1 > > However, it makes sense in that it's giving bigram counts for these > two word unigrams (separated by the <>). > > Again, I don't know that I recommend this as a solution, but I think > it makes some useful points about using --token to represent the data > in a somewhat different than expected way. > > I hope this all helps! > > Cordially, > Ted > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > ------------------------------------ Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/ngram/ <*> Your email settings: Individual Email | Traditional <*> To change settings online go to: http://groups.yahoo.com/group/ngram/join (Yahoo! ID required) <*> To change settings via email: mailto:ngram-dig...@yahoogroups.com mailto:ngram-fullfeatu...@yahoogroups.com <*> To unsubscribe from this group, send an email to: ngram-unsubscr...@yahoogroups.com <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/