Greetings Merce, This is fairly easy to handle via the --token option. You simply specify a regular expression that says a token in a string followed by a - followed by a string. You can customize a --token file many ways, but the following example will handle hyphenated words. Please do let us know if additional questions arise!
linux@linux:~> count.pl test.out test.txt --token token.txt linux@linux:~> more test.out 13 cell-phone<>It<>1 1 1 the<>village-shop<>1 1 1 s<>extra-nice<>1 1 1 village-shop<>today<>1 1 1 bought<>a<>1 1 1 went<>to<>1 1 1 a<>cell-phone<>1 1 1 i<>went<>1 1 1 today<>and<>1 1 1 It<>s<>1 1 1 and<>I<>1 1 1 I<>bought<>1 1 1 to<>the<>1 1 1 linux@linux:~> cat test.txt i went to the village-shop today, and I bought a cell-phone. It's extra-nice. linux@linux:~> cat token.txt /\w+\-\w+/ /\w+/ Enjoy, Ted On Wed, Apr 20, 2011 at 2:20 PM, mercevg <merc...@yahoo.es> wrote: > > > Dear all, > > I would like to know if it's possible to get a list of ngrams with a hyphen > inside, maybe during the tokenization process. > > For exemple, I want to get these bigrams: > - call-connected signal > - clear-back signal > - clear-forward signal > > Instead of two bigrams for each one: > - call<>connected<>179 2608 527 > connected<>signal<>189 320 9176 > > - clear<>back<>283 1115 733 > back<>signal<>157 380 9176 > > - clear<>forward<>632 1115 877 > forward<>signal<>493 1547 9176 > > Thanks a lot, > > Mercè > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse