Greetings Merce,

This is fairly easy to handle via the --token option. You simply specify a
regular expression that says a token in a string followed by a - followed by
a string. You can customize a --token file many ways, but the following
example will handle hyphenated words. Please do let us know if additional
questions arise!

linux@linux:~> count.pl test.out test.txt --token token.txt

linux@linux:~> more test.out
13
cell-phone<>It<>1 1 1
the<>village-shop<>1 1 1
s<>extra-nice<>1 1 1
village-shop<>today<>1 1 1
bought<>a<>1 1 1
went<>to<>1 1 1
a<>cell-phone<>1 1 1
i<>went<>1 1 1
today<>and<>1 1 1
It<>s<>1 1 1
and<>I<>1 1 1
I<>bought<>1 1 1
to<>the<>1 1 1

linux@linux:~> cat test.txt
i went to the village-shop today, and I bought a cell-phone. It's
extra-nice.

linux@linux:~> cat token.txt
/\w+\-\w+/
/\w+/

Enjoy,
Ted

On Wed, Apr 20, 2011 at 2:20 PM, mercevg <merc...@yahoo.es> wrote:

>
>
> Dear all,
>
> I would like to know if it's possible to get a list of ngrams with a hyphen
> inside, maybe during the tokenization process.
>
> For exemple, I want to get these bigrams:
> - call-connected signal
> - clear-back signal
> - clear-forward signal
>
> Instead of two bigrams for each one:
> - call<>connected<>179 2608 527
> connected<>signal<>189 320 9176
>
> - clear<>back<>283 1115 733
> back<>signal<>157 380 9176
>
> - clear<>forward<>632 1115 877
> forward<>signal<>493 1547 9176
>
> Thanks a lot,
>
> Mercè
>
>  
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to