Hi Ted,

Thank you very much! I've redefined \b character with this new regex and now in 
my results list I get terminological bigrams with hyphen as

in-band signalling
in-station modem

and not bigrams with an stopword as  

in Recommendation
defined in
shown in
described in
given in

I think this new regex is very useful to get relevant unigrams, bigrams, etc. 
from a (specialised) corpus.

If I need some additional explanations, I'll let you know.

By the way, do you know if this regex could be also useful to get ngrams with 
the character "l·l" inside?

Best wishes,
Mercè

--- In ngram@yahoogroups.com, Ted Pedersen <tpederse@...> wrote:
>
> Hi Merce,
> 
> Ah, yes, I see what you mean. The problem with using \s in the stoplist is
> that the toknization prior to checking for stop words does not include a
> trailing \s, and so /\s[Ii]n\s/ is never "matched".
> 
> The trick here is to redfine the \b character class so it doesn't include -.
> This involves a bit of regular expression tampering which looks kind of
> awful but in fact works pretty nicely. What I have below is a regex (in a
> stoplist) that redefines \b as including - and /.
> 
> @stop.mode=OR
> /\b[iI]n(?:(?<![\w/-])(?=[\w/-])|(?<=[\w/-])(?![\w/-]))/
> 
> So we have a word boundary \b
> followed by In or in
> followed by a word boundary that includes - or /
> 
> ted@linux-zxku:~> count.pl out test.txt --stop stop.txt --token token.txt
> 
> ted@linux-zxku:~> more out
> 4
> late<>june<>1 1 1
> in-line<>skating<>1 1 1
> i<>like<>1 1 1
> like<>in-line<>1 1 1
> 
> ted@linux-zxku:~> cat test.txt
> i like in-line skating in late june.
> 
> ted@linux-zxku:~> cat stop.txt
> @stop.mode=OR
> /\b[iI]n(?:(?<![\w/-])(?=[\w/-])|(?<=[\w/-])(?![\w/-]))/
> 
> It's important to say this regex came from Perl Monks,
> http://www.perlmonks.org/?node_id=308744
> 
> I hope this makes some sense, at least in a general way. I wouldn't worry
> too much about the regex itself, although if you need it modified in some
> way do let me know and we can work that out.
> 
> Enjoy,
> Ted
> 
> On Sat, Apr 23, 2011 at 4:51 PM, mercevg <mercevg@...> wrote:
> 
> >
> >
> > Hi Ted,
> >
> > I've modified the stopwords list using \s/ instead of \b/, but the problem
> > is not solved at all, because now in my bigrams list I get interesting
> > bigrams as
> >
> > in-band<>signalling
> > in-station<>modem
> >
> > But also, new bigrams without interest as
> >
> > in Recommendation
> > defined in
> > shown in
> > described in
> > given in
> >
> > It's possible to get just bigrams like
> >
> > in-band<>signalling
> > in-station<>modem
> >
> > And not the others new bigrams without interest?
> >
> > Thanks for your help,
> >
> >
> > Mercè
> >
> > --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@> wrote:
> > >
> > > Hi Merce,
> > >
> > > Yes, indeed, you can do as you describe. This gets into some important
> > > details about regular expressions that I'm happy to have a chance to
> > > mention. In the default stoplist the stop words are delimited by \b, as
> > in
> > >
> > > /\bin\b/
> > >
> > > This means match "in" as a stop word when surrounded by a word boundary.
> > A
> > > word boundary is spaces as well as various punctuations, including the -.
> > >
> > > So, if you want to find bigrams like "in-line" but then exclude ones like
> > > "in the", then you need to adjust the stoplist so that the stop words are
> > > perhaps just surrounded by spaces. I say perhaps since there are various
> > > ways to do this, but the simplest one is shown below...
> > >
> > > ted@linux-zxku:~> more stop.txt
> > > @stop.mode=OR
> > > /\b[iI]n\s/
> > >
> > > ted@linux-zxku:~> more token.txt
> > > /\w+-\w+/
> > > /\w+/
> > >
> > > ted@linux-zxku:~> more test.txt
> > > i like in-line skating in late june.
> > >
> > > ted@linux-zxku:~> count.pl output.txt test.txt --token token.txt --stop
> > > stop.txt
> > >
> > > ted@linux-zxku:~> more output.txt
> > > 6
> > > in<>late<>1 1 1
> > > late<>june<>1 1 1
> > > skating<>in<>1 1 1
> > > in-line<>skating<>1 1 1
> > > i<>like<>1 1 1
> > > like<>in-line<>1 1 1
> > >
> > > I hope this helps.
> > >
> > > Enjoy,
> > > Ted
> > >
> > > On Fri, Apr 22, 2011 at 11:41 AM, mercevg <mercevg@> wrote:
> > >
> > > >
> > > >
> > > > Ted,
> > > >
> > > > Thanks, I've add this regular expression in my tokens file and it works
> > > > well.
> > > >
> > > > One more comment about that:
> > > >
> > > > In my corpus I have some interesting bigrams as
> > > > "in-band signalling"
> > > > "in-call rearrangement"
> > > > "in-slot signalling"
> > > >
> > > > If I filter as a stopword "in", I can't get these kind of bigrams from
> > my
> > > > corpus. On the contrary, if "in" it's not on my stopwords list, I
> > retrieve
> > > > these bigrams but also I get more bigrams without interest as
> > > >
> > > > "in Recommendation"
> > > > "in Figure"
> > > > "in order"
> > > >
> > > > My question is: Can I filter and retrieve these two groups of bigrams
> > at
> > > > the same time?
> > > >
> > > > Thank you for your help,
> > > >
> > > > Mercè
> > > >
> > > >
> > > > --- In ngram@yahoogroups.com, Ted Pedersen <tpederse@> wrote:
> > > > >
> > > > > Greetings Merce,
> > > > >
> > > > > This is fairly easy to handle via the --token option. You simply
> > specify
> > > > a
> > > > > regular expression that says a token in a string followed by a -
> > followed
> > > > by
> > > > > a string. You can customize a --token file many ways, but the
> > following
> > > > > example will handle hyphenated words. Please do let us know if
> > additional
> > > > > questions arise!
> > > > >
> > > > > linux@linux:~> count.pl test.out test.txt --token token.txt
> > > > >
> > > > > linux@linux:~> more test.out
> > > > > 13
> > > > > cell-phone<>It<>1 1 1
> > > > > the<>village-shop<>1 1 1
> > > > > s<>extra-nice<>1 1 1
> > > > > village-shop<>today<>1 1 1
> > > > > bought<>a<>1 1 1
> > > > > went<>to<>1 1 1
> > > > > a<>cell-phone<>1 1 1
> > > > > i<>went<>1 1 1
> > > > > today<>and<>1 1 1
> > > > > It<>s<>1 1 1
> > > > > and<>I<>1 1 1
> > > > > I<>bought<>1 1 1
> > > > > to<>the<>1 1 1
> > > > >
> > > > > linux@linux:~> cat test.txt
> > > > > i went to the village-shop today, and I bought a cell-phone. It's
> > > > > extra-nice.
> > > > >
> > > > > linux@linux:~> cat token.txt
> > > > > /\w+\-\w+/
> > > > > /\w+/
> > > > >
> > > > > Enjoy,
> > > > > Ted
> > > > >
> > > > > On Wed, Apr 20, 2011 at 2:20 PM, mercevg <mercevg@> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > > I would like to know if it's possible to get a list of ngrams with
> > a
> > > > hyphen
> > > > > > inside, maybe during the tokenization process.
> > > > > >
> > > > > > For exemple, I want to get these bigrams:
> > > > > > - call-connected signal
> > > > > > - clear-back signal
> > > > > > - clear-forward signal
> > > > > >
> > > > > > Instead of two bigrams for each one:
> > > > > > - call<>connected<>179 2608 527
> > > > > > connected<>signal<>189 320 9176
> > > > > >
> > > > > > - clear<>back<>283 1115 733
> > > > > > back<>signal<>157 380 9176
> > > > > >
> > > > > > - clear<>forward<>632 1115 877
> > > > > > forward<>signal<>493 1547 9176
> > > > > >
> > > > > > Thanks a lot,
> > > > > >
> > > > > > Mercè
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ted Pedersen
> > > > > http://www.d.umn.edu/~tpederse
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Pedersen
> > > http://www.d.umn.edu/~tpederse
> > >
> >
> >  
> >
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Reply via email to