Hi Merce, Yes, indeed, you can do as you describe. This gets into some important details about regular expressions that I'm happy to have a chance to mention. In the default stoplist the stop words are delimited by \b, as in
/\bin\b/ This means match "in" as a stop word when surrounded by a word boundary. A word boundary is spaces as well as various punctuations, including the -. So, if you want to find bigrams like "in-line" but then exclude ones like "in the", then you need to adjust the stoplist so that the stop words are perhaps just surrounded by spaces. I say perhaps since there are various ways to do this, but the simplest one is shown below... ted@linux-zxku:~> more stop.txt @stop.mode=OR /\b[iI]n\s/ ted@linux-zxku:~> more token.txt /\w+-\w+/ /\w+/ ted@linux-zxku:~> more test.txt i like in-line skating in late june. ted@linux-zxku:~> count.pl output.txt test.txt --token token.txt --stop stop.txt ted@linux-zxku:~> more output.txt 6 in<>late<>1 1 1 late<>june<>1 1 1 skating<>in<>1 1 1 in-line<>skating<>1 1 1 i<>like<>1 1 1 like<>in-line<>1 1 1 I hope this helps. Enjoy, Ted On Fri, Apr 22, 2011 at 11:41 AM, mercevg <merc...@yahoo.es> wrote: > > > Ted, > > Thanks, I've add this regular expression in my tokens file and it works > well. > > One more comment about that: > > In my corpus I have some interesting bigrams as > "in-band signalling" > "in-call rearrangement" > "in-slot signalling" > > If I filter as a stopword "in", I can't get these kind of bigrams from my > corpus. On the contrary, if "in" it's not on my stopwords list, I retrieve > these bigrams but also I get more bigrams without interest as > > "in Recommendation" > "in Figure" > "in order" > > My question is: Can I filter and retrieve these two groups of bigrams at > the same time? > > Thank you for your help, > > Mercè > > > --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote: > > > > Greetings Merce, > > > > This is fairly easy to handle via the --token option. You simply specify > a > > regular expression that says a token in a string followed by a - followed > by > > a string. You can customize a --token file many ways, but the following > > example will handle hyphenated words. Please do let us know if additional > > questions arise! > > > > linux@linux:~> count.pl test.out test.txt --token token.txt > > > > linux@linux:~> more test.out > > 13 > > cell-phone<>It<>1 1 1 > > the<>village-shop<>1 1 1 > > s<>extra-nice<>1 1 1 > > village-shop<>today<>1 1 1 > > bought<>a<>1 1 1 > > went<>to<>1 1 1 > > a<>cell-phone<>1 1 1 > > i<>went<>1 1 1 > > today<>and<>1 1 1 > > It<>s<>1 1 1 > > and<>I<>1 1 1 > > I<>bought<>1 1 1 > > to<>the<>1 1 1 > > > > linux@linux:~> cat test.txt > > i went to the village-shop today, and I bought a cell-phone. It's > > extra-nice. > > > > linux@linux:~> cat token.txt > > /\w+\-\w+/ > > /\w+/ > > > > Enjoy, > > Ted > > > > On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@...> wrote: > > > > > > > > > > > Dear all, > > > > > > I would like to know if it's possible to get a list of ngrams with a > hyphen > > > inside, maybe during the tokenization process. > > > > > > For exemple, I want to get these bigrams: > > > - call-connected signal > > > - clear-back signal > > > - clear-forward signal > > > > > > Instead of two bigrams for each one: > > > - call<>connected<>179 2608 527 > > > connected<>signal<>189 320 9176 > > > > > > - clear<>back<>283 1115 733 > > > back<>signal<>157 380 9176 > > > > > > - clear<>forward<>632 1115 877 > > > forward<>signal<>493 1547 9176 > > > > > > Thanks a lot, > > > > > > Mercè > > > > > > > > > > > > > > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > > > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse