Hi Ted,

I've modified the stopwords list using \s/ instead of \b/, but the problem is 
not solved at all, because now in my bigrams list I get interesting bigrams as

in-band<>signalling
in-station<>modem

But also, new bigrams without interest as

in Recommendation
defined in
shown in
described in
given in

It's possible to get just bigrams like 

in-band<>signalling
in-station<>modem

And not the others new bigrams without interest?

Thanks for your help,

Mercè




--- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote:
>
> Hi Merce,
> 
> Yes, indeed, you can do as you describe. This gets into some important
> details about regular expressions that I'm happy to have a chance to
> mention.  In the default stoplist the stop words are delimited by \b, as in
> 
> /\bin\b/
> 
> This means match "in" as a stop word when surrounded by a word boundary. A
> word boundary is spaces as well as various punctuations, including the -.
> 
> So, if you want to find bigrams like "in-line" but then exclude ones like
> "in the", then you need to adjust the stoplist so that the stop words are
> perhaps just surrounded by spaces.  I say perhaps since there are various
> ways to do this, but the simplest one is shown below...
> 
> ted@linux-zxku:~> more stop.txt
> @stop.mode=OR
> /\b[iI]n\s/
> 
> ted@linux-zxku:~> more token.txt
> /\w+-\w+/
> /\w+/
> 
> ted@linux-zxku:~> more test.txt
> i like in-line skating in late june.
> 
> ted@linux-zxku:~> count.pl output.txt test.txt --token token.txt --stop
> stop.txt
> 
> ted@linux-zxku:~> more output.txt
> 6
> in<>late<>1 1 1
> late<>june<>1 1 1
> skating<>in<>1 1 1
> in-line<>skating<>1 1 1
> i<>like<>1 1 1
> like<>in-line<>1 1 1
> 
> I hope this helps.
> 
> Enjoy,
> Ted
> 
> On Fri, Apr 22, 2011 at 11:41 AM, mercevg <mercevg@...> wrote:
> 
> >
> >
> > Ted,
> >
> > Thanks, I've add this regular expression in my tokens file and it works
> > well.
> >
> > One more comment about that:
> >
> > In my corpus I have some interesting bigrams as
> > "in-band signalling"
> > "in-call rearrangement"
> > "in-slot signalling"
> >
> > If I filter as a stopword "in", I can't get these kind of bigrams from my
> > corpus. On the contrary, if "in" it's not on my stopwords list, I retrieve
> > these bigrams but also I get more bigrams without interest as
> >
> > "in Recommendation"
> > "in Figure"
> > "in order"
> >
> > My question is: Can I filter and retrieve these two groups of bigrams at
> > the same time?
> >
> > Thank you for your help,
> >
> > Mercè
> >
> >
> > --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@> wrote:
> > >
> > > Greetings Merce,
> > >
> > > This is fairly easy to handle via the --token option. You simply specify
> > a
> > > regular expression that says a token in a string followed by a - followed
> > by
> > > a string. You can customize a --token file many ways, but the following
> > > example will handle hyphenated words. Please do let us know if additional
> > > questions arise!
> > >
> > > linux@linux:~> count.pl test.out test.txt --token token.txt
> > >
> > > linux@linux:~> more test.out
> > > 13
> > > cell-phone<>It<>1 1 1
> > > the<>village-shop<>1 1 1
> > > s<>extra-nice<>1 1 1
> > > village-shop<>today<>1 1 1
> > > bought<>a<>1 1 1
> > > went<>to<>1 1 1
> > > a<>cell-phone<>1 1 1
> > > i<>went<>1 1 1
> > > today<>and<>1 1 1
> > > It<>s<>1 1 1
> > > and<>I<>1 1 1
> > > I<>bought<>1 1 1
> > > to<>the<>1 1 1
> > >
> > > linux@linux:~> cat test.txt
> > > i went to the village-shop today, and I bought a cell-phone. It's
> > > extra-nice.
> > >
> > > linux@linux:~> cat token.txt
> > > /\w+\-\w+/
> > > /\w+/
> > >
> > > Enjoy,
> > > Ted
> > >
> > > On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@> wrote:
> > >
> > > >
> > > >
> > > > Dear all,
> > > >
> > > > I would like to know if it's possible to get a list of ngrams with a
> > hyphen
> > > > inside, maybe during the tokenization process.
> > > >
> > > > For exemple, I want to get these bigrams:
> > > > - call-connected signal
> > > > - clear-back signal
> > > > - clear-forward signal
> > > >
> > > > Instead of two bigrams for each one:
> > > > - call<>connected<>179 2608 527
> > > > connected<>signal<>189 320 9176
> > > >
> > > > - clear<>back<>283 1115 733
> > > > back<>signal<>157 380 9176
> > > >
> > > > - clear<>forward<>632 1115 877
> > > > forward<>signal<>493 1547 9176
> > > >
> > > > Thanks a lot,
> > > >
> > > > Mercè
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Pedersen
> > > http://www.d.umn.edu/~tpederse
> > >
> >
> >  
> >
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Reply via email to