Ted,

Thanks, I've add this regular expression in my tokens file and it works well.

One more comment about that:

In my corpus I have some interesting bigrams as 
"in-band signalling"
"in-call rearrangement"
"in-slot signalling"

If I filter as a stopword "in", I can't get these kind of bigrams from my 
corpus. On the contrary, if "in" it's not on my stopwords list, I retrieve 
these bigrams but also I get more bigrams without  interest as

"in Recommendation"
"in Figure"  
"in order"

My question is: Can I filter and retrieve these two groups of bigrams at the 
same time? 
 
Thank you for your help,

Mercè


--- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote:
>
> Greetings Merce,
> 
> This is fairly easy to handle via the --token option. You simply specify a
> regular expression that says a token in a string followed by a - followed by
> a string. You can customize a --token file many ways, but the following
> example will handle hyphenated words. Please do let us know if additional
> questions arise!
> 
> linux@linux:~> count.pl test.out test.txt --token token.txt
> 
> linux@linux:~> more test.out
> 13
> cell-phone<>It<>1 1 1
> the<>village-shop<>1 1 1
> s<>extra-nice<>1 1 1
> village-shop<>today<>1 1 1
> bought<>a<>1 1 1
> went<>to<>1 1 1
> a<>cell-phone<>1 1 1
> i<>went<>1 1 1
> today<>and<>1 1 1
> It<>s<>1 1 1
> and<>I<>1 1 1
> I<>bought<>1 1 1
> to<>the<>1 1 1
> 
> linux@linux:~> cat test.txt
> i went to the village-shop today, and I bought a cell-phone. It's
> extra-nice.
> 
> linux@linux:~> cat token.txt
> /\w+\-\w+/
> /\w+/
> 
> Enjoy,
> Ted
> 
> On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@...> wrote:
> 
> >
> >
> > Dear all,
> >
> > I would like to know if it's possible to get a list of ngrams with a hyphen
> > inside, maybe during the tokenization process.
> >
> > For exemple, I want to get these bigrams:
> > - call-connected signal
> > - clear-back signal
> > - clear-forward signal
> >
> > Instead of two bigrams for each one:
> > - call<>connected<>179 2608 527
> > connected<>signal<>189 320 9176
> >
> > - clear<>back<>283 1115 733
> > back<>signal<>157 380 9176
> >
> > - clear<>forward<>632 1115 877
> > forward<>signal<>493 1547 9176
> >
> > Thanks a lot,
> >
> > Mercè
> >
> >  
> >
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Reply via email to