Hi Merce,

Yes, indeed, you can do as you describe. This gets into some important
details about regular expressions that I'm happy to have a chance to
mention.  In the default stoplist the stop words are delimited by \b, as in

/\bin\b/

This means match "in" as a stop word when surrounded by a word boundary. A
word boundary is spaces as well as various punctuations, including the -.

So, if you want to find bigrams like "in-line" but then exclude ones like
"in the", then you need to adjust the stoplist so that the stop words are
perhaps just surrounded by spaces.  I say perhaps since there are various
ways to do this, but the simplest one is shown below...

ted@linux-zxku:~> more stop.txt
@stop.mode=OR
/\b[iI]n\s/

ted@linux-zxku:~> more token.txt
/\w+-\w+/
/\w+/

ted@linux-zxku:~> more test.txt
i like in-line skating in late june.

ted@linux-zxku:~> count.pl output.txt test.txt --token token.txt --stop
stop.txt

ted@linux-zxku:~> more output.txt
6
in<>late<>1 1 1
late<>june<>1 1 1
skating<>in<>1 1 1
in-line<>skating<>1 1 1
i<>like<>1 1 1
like<>in-line<>1 1 1

I hope this helps.

Enjoy,
Ted

On Fri, Apr 22, 2011 at 11:41 AM, mercevg <merc...@yahoo.es> wrote:

>
>
> Ted,
>
> Thanks, I've add this regular expression in my tokens file and it works
> well.
>
> One more comment about that:
>
> In my corpus I have some interesting bigrams as
> "in-band signalling"
> "in-call rearrangement"
> "in-slot signalling"
>
> If I filter as a stopword "in", I can't get these kind of bigrams from my
> corpus. On the contrary, if "in" it's not on my stopwords list, I retrieve
> these bigrams but also I get more bigrams without interest as
>
> "in Recommendation"
> "in Figure"
> "in order"
>
> My question is: Can I filter and retrieve these two groups of bigrams at
> the same time?
>
> Thank you for your help,
>
> Mercè
>
>
> --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote:
> >
> > Greetings Merce,
> >
> > This is fairly easy to handle via the --token option. You simply specify
> a
> > regular expression that says a token in a string followed by a - followed
> by
> > a string. You can customize a --token file many ways, but the following
> > example will handle hyphenated words. Please do let us know if additional
> > questions arise!
> >
> > linux@linux:~> count.pl test.out test.txt --token token.txt
> >
> > linux@linux:~> more test.out
> > 13
> > cell-phone<>It<>1 1 1
> > the<>village-shop<>1 1 1
> > s<>extra-nice<>1 1 1
> > village-shop<>today<>1 1 1
> > bought<>a<>1 1 1
> > went<>to<>1 1 1
> > a<>cell-phone<>1 1 1
> > i<>went<>1 1 1
> > today<>and<>1 1 1
> > It<>s<>1 1 1
> > and<>I<>1 1 1
> > I<>bought<>1 1 1
> > to<>the<>1 1 1
> >
> > linux@linux:~> cat test.txt
> > i went to the village-shop today, and I bought a cell-phone. It's
> > extra-nice.
> >
> > linux@linux:~> cat token.txt
> > /\w+\-\w+/
> > /\w+/
> >
> > Enjoy,
> > Ted
> >
> > On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@...> wrote:
> >
> > >
> > >
> > > Dear all,
> > >
> > > I would like to know if it's possible to get a list of ngrams with a
> hyphen
> > > inside, maybe during the tokenization process.
> > >
> > > For exemple, I want to get these bigrams:
> > > - call-connected signal
> > > - clear-back signal
> > > - clear-forward signal
> > >
> > > Instead of two bigrams for each one:
> > > - call<>connected<>179 2608 527
> > > connected<>signal<>189 320 9176
> > >
> > > - clear<>back<>283 1115 733
> > > back<>signal<>157 380 9176
> > >
> > > - clear<>forward<>632 1115 877
> > > forward<>signal<>493 1547 9176
> > >
> > > Thanks a lot,
> > >
> > > Mercè
> > >
> > >
> > >
> >
> >
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse
> >
>
>  
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to