Hi Otis,

It's great to hear you are using NSP! The stop lists have two
different "modes" in which they operate - OR and AND mode. By default
they are used in AND mode, where a bigram must consist of two stop
words to be removed (that is both words must be stop words). It sounds
like you would like to use the OR mode, where a bigram would be
eliminated if either word is a stop word. You can do that by
specifying OR mode on the first line of your stop.txt file.

@stop.mode=OR
/said/
/the/

This should result in a list more to your liking! Notice that if you
use --ngram 1 then the OR or AND doesn't matter, since any unigram
that is a stop word will be removed. For ngrams greater than 2, AND
and OR stop modes operate as expected - AND requiring that all n words
be stop words to be removed, while OR would eliminate them if any
single word is stop word.

I hope this all makes sense. More on these issues here :

http://search.cpan.org/dist/Text-NSP/doc/README.pod#5.6._%22Stopping%22_the_Ngrams:

Please let us know if there are any additional questions or suggestions!

Cordially,
Ted

On Thu, Oct 16, 2008 at 2:51 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Hello Ted,
>
> I was playing with Text::NSP, count.pl in particular, and I might be seeing a 
> small bug.
> I ran it against some news articles, like this:
>
> $ count.pl -stop stop.txt -frequency 5 -window 4 -hist hist.txt count.txt 
> a1.txt
>
> This produced count.txt with:
>
> 636
> .<>Obama<>11 129 21
> ,<>said<>9 126 13
> ,<>Obama<>7 126 21
> the<>.<>7 15 132
> ,<>McCain<>6 126 11
> ,<>.<>6 126 132
> said<>.<>6 9 132
> .<>,<>5 129 124
> Obama<>.<>5 13 132
> the<>,<>5 15 124
> .<>The<>5 129 8
> in<>.<>5 7 132
>
> Note all those stop words in there.  I'd like to get rid of them and I think 
> that's what that -stop stop.txt should do, no?
>
> $ egrep '/said/|/the/' stop.txt
> /said/
> /the/
>
> Is this a bug or am I doing something wrong?
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to