Hi Ted, I've modified the stopwords list using \s/ instead of \b/, but the problem is not solved at all, because now in my bigrams list I get interesting bigrams as
in-band<>signalling in-station<>modem But also, new bigrams without interest as in Recommendation defined in shown in described in given in It's possible to get just bigrams like in-band<>signalling in-station<>modem And not the others new bigrams without interest? Thanks for your help, Mercè --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote: > > Hi Merce, > > Yes, indeed, you can do as you describe. This gets into some important > details about regular expressions that I'm happy to have a chance to > mention. In the default stoplist the stop words are delimited by \b, as in > > /\bin\b/ > > This means match "in" as a stop word when surrounded by a word boundary. A > word boundary is spaces as well as various punctuations, including the -. > > So, if you want to find bigrams like "in-line" but then exclude ones like > "in the", then you need to adjust the stoplist so that the stop words are > perhaps just surrounded by spaces. I say perhaps since there are various > ways to do this, but the simplest one is shown below... > > ted@linux-zxku:~> more stop.txt > @stop.mode=OR > /\b[iI]n\s/ > > ted@linux-zxku:~> more token.txt > /\w+-\w+/ > /\w+/ > > ted@linux-zxku:~> more test.txt > i like in-line skating in late june. > > ted@linux-zxku:~> count.pl output.txt test.txt --token token.txt --stop > stop.txt > > ted@linux-zxku:~> more output.txt > 6 > in<>late<>1 1 1 > late<>june<>1 1 1 > skating<>in<>1 1 1 > in-line<>skating<>1 1 1 > i<>like<>1 1 1 > like<>in-line<>1 1 1 > > I hope this helps. > > Enjoy, > Ted > > On Fri, Apr 22, 2011 at 11:41 AM, mercevg <mercevg@...> wrote: > > > > > > > Ted, > > > > Thanks, I've add this regular expression in my tokens file and it works > > well. > > > > One more comment about that: > > > > In my corpus I have some interesting bigrams as > > "in-band signalling" > > "in-call rearrangement" > > "in-slot signalling" > > > > If I filter as a stopword "in", I can't get these kind of bigrams from my > > corpus. On the contrary, if "in" it's not on my stopwords list, I retrieve > > these bigrams but also I get more bigrams without interest as > > > > "in Recommendation" > > "in Figure" > > "in order" > > > > My question is: Can I filter and retrieve these two groups of bigrams at > > the same time? > > > > Thank you for your help, > > > > Mercè > > > > > > --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@> wrote: > > > > > > Greetings Merce, > > > > > > This is fairly easy to handle via the --token option. You simply specify > > a > > > regular expression that says a token in a string followed by a - followed > > by > > > a string. You can customize a --token file many ways, but the following > > > example will handle hyphenated words. Please do let us know if additional > > > questions arise! > > > > > > linux@linux:~> count.pl test.out test.txt --token token.txt > > > > > > linux@linux:~> more test.out > > > 13 > > > cell-phone<>It<>1 1 1 > > > the<>village-shop<>1 1 1 > > > s<>extra-nice<>1 1 1 > > > village-shop<>today<>1 1 1 > > > bought<>a<>1 1 1 > > > went<>to<>1 1 1 > > > a<>cell-phone<>1 1 1 > > > i<>went<>1 1 1 > > > today<>and<>1 1 1 > > > It<>s<>1 1 1 > > > and<>I<>1 1 1 > > > I<>bought<>1 1 1 > > > to<>the<>1 1 1 > > > > > > linux@linux:~> cat test.txt > > > i went to the village-shop today, and I bought a cell-phone. It's > > > extra-nice. > > > > > > linux@linux:~> cat token.txt > > > /\w+\-\w+/ > > > /\w+/ > > > > > > Enjoy, > > > Ted > > > > > > On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@> wrote: > > > > > > > > > > > > > > > Dear all, > > > > > > > > I would like to know if it's possible to get a list of ngrams with a > > hyphen > > > > inside, maybe during the tokenization process. > > > > > > > > For exemple, I want to get these bigrams: > > > > - call-connected signal > > > > - clear-back signal > > > > - clear-forward signal > > > > > > > > Instead of two bigrams for each one: > > > > - call<>connected<>179 2608 527 > > > > connected<>signal<>189 320 9176 > > > > > > > > - clear<>back<>283 1115 733 > > > > back<>signal<>157 380 9176 > > > > > > > > - clear<>forward<>632 1115 877 > > > > forward<>signal<>493 1547 9176 > > > > > > > > Thanks a lot, > > > > > > > > Mercè > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ted Pedersen > > > http://www.d.umn.edu/~tpederse > > > > > > > > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >