Hi Florian,

See comments below...

On Sun, Feb 22, 2009 at 6:19 PM, ftwaroch <f.a.twar...@cs.cf.ac.uk> wrote:
> Dear ngram group,
>
> Thanks for a great tool! I started playing with count.pl a couple of
> days ago and wondered if it was possible to do the opposite of a
> stopword list. My intention is to create an n-gram file that contains
> only n-grams with a certain item (I investigate placenames in a text.)
> I replaced all known placenames with a dummy value XTOPOX, and
> defined a stoplist file -
>
> @stop.mode=AND
> /[^XTOP]/

A very interesting question, and you have the right idea. However, I
think your regular expression might be doing something other than you
expect.

In Perl regular expressions,

[ABC]

represents a character class that will match any one letter (either A
or B or C).

[^ABC]

will match any letter except for A or B or C

so,

[^XTOP]

has the effect of matching any single letter except X or O or T or P.
That will do some of what you want, but not all of it.

In general I don't think Perl regular expressions on their own offer a
clean way to define "match everything except the following string"
(ie, negate the regex). There is an operator that does exactly this
that you can use within expressions, as in ....

if ($x !~ /XTOPOX/)

but of course the stoplist doesn't take expressions as input...just
regular expressions.

Let me think on this a bit more on this...

Cordially,
Ted

>
> This is not very clean approach as all patterns that are not XTOP are
> returned and I get noise back as well, see example:
>
> example_out.txt
>
> 16
> to<>XTOPOX<>2 2 4
> XTOPOX<>.<>2 4 3
> Tudur<>XTOPOX<>1 1 4
> XTOPOX<>on<>1 4 1
> XTOPOX<>,<>1 4 1
> OF<>TO<>1 1 1
> XX<>THE<>1 1 1
> CHAPTER<>XX<>1 2 1
> ,<>XTOPOX<>1 2 4
> TO<>DAY<>1 1 1
> X<>LLYWELYN<>1 1 1
> T<>.<>1 1 3
> ,<>T<>1 2 1
> CHAPTER<>X<>1 2 1
>
> Is there approach to that? If you have any pointers for me I would be
> very happy.
> many thanks,
>
> Florian
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to