Ted, Thanks, I've add this regular expression in my tokens file and it works well.
One more comment about that: In my corpus I have some interesting bigrams as "in-band signalling" "in-call rearrangement" "in-slot signalling" If I filter as a stopword "in", I can't get these kind of bigrams from my corpus. On the contrary, if "in" it's not on my stopwords list, I retrieve these bigrams but also I get more bigrams without interest as "in Recommendation" "in Figure" "in order" My question is: Can I filter and retrieve these two groups of bigrams at the same time? Thank you for your help, Mercè --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote: > > Greetings Merce, > > This is fairly easy to handle via the --token option. You simply specify a > regular expression that says a token in a string followed by a - followed by > a string. You can customize a --token file many ways, but the following > example will handle hyphenated words. Please do let us know if additional > questions arise! > > linux@linux:~> count.pl test.out test.txt --token token.txt > > linux@linux:~> more test.out > 13 > cell-phone<>It<>1 1 1 > the<>village-shop<>1 1 1 > s<>extra-nice<>1 1 1 > village-shop<>today<>1 1 1 > bought<>a<>1 1 1 > went<>to<>1 1 1 > a<>cell-phone<>1 1 1 > i<>went<>1 1 1 > today<>and<>1 1 1 > It<>s<>1 1 1 > and<>I<>1 1 1 > I<>bought<>1 1 1 > to<>the<>1 1 1 > > linux@linux:~> cat test.txt > i went to the village-shop today, and I bought a cell-phone. It's > extra-nice. > > linux@linux:~> cat token.txt > /\w+\-\w+/ > /\w+/ > > Enjoy, > Ted > > On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@...> wrote: > > > > > > > Dear all, > > > > I would like to know if it's possible to get a list of ngrams with a hyphen > > inside, maybe during the tokenization process. > > > > For exemple, I want to get these bigrams: > > - call-connected signal > > - clear-back signal > > - clear-forward signal > > > > Instead of two bigrams for each one: > > - call<>connected<>179 2608 527 > > connected<>signal<>189 320 9176 > > > > - clear<>back<>283 1115 733 > > back<>signal<>157 380 9176 > > > > - clear<>forward<>632 1115 877 > > forward<>signal<>493 1547 9176 > > > > Thanks a lot, > > > > Mercè > > > > > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >