Hi Otis, It's great to hear you are using NSP! The stop lists have two different "modes" in which they operate - OR and AND mode. By default they are used in AND mode, where a bigram must consist of two stop words to be removed (that is both words must be stop words). It sounds like you would like to use the OR mode, where a bigram would be eliminated if either word is a stop word. You can do that by specifying OR mode on the first line of your stop.txt file.
@stop.mode=OR /said/ /the/ This should result in a list more to your liking! Notice that if you use --ngram 1 then the OR or AND doesn't matter, since any unigram that is a stop word will be removed. For ngrams greater than 2, AND and OR stop modes operate as expected - AND requiring that all n words be stop words to be removed, while OR would eliminate them if any single word is stop word. I hope this all makes sense. More on these issues here : http://search.cpan.org/dist/Text-NSP/doc/README.pod#5.6._%22Stopping%22_the_Ngrams: Please let us know if there are any additional questions or suggestions! Cordially, Ted On Thu, Oct 16, 2008 at 2:51 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hello Ted, > > I was playing with Text::NSP, count.pl in particular, and I might be seeing a > small bug. > I ran it against some news articles, like this: > > $ count.pl -stop stop.txt -frequency 5 -window 4 -hist hist.txt count.txt > a1.txt > > This produced count.txt with: > > 636 > .<>Obama<>11 129 21 > ,<>said<>9 126 13 > ,<>Obama<>7 126 21 > the<>.<>7 15 132 > ,<>McCain<>6 126 11 > ,<>.<>6 126 132 > said<>.<>6 9 132 > .<>,<>5 129 124 > Obama<>.<>5 13 132 > the<>,<>5 15 124 > .<>The<>5 129 8 > in<>.<>5 7 132 > > Note all those stop words in there. I'd like to get rid of them and I think > that's what that -stop stop.txt should do, no? > > $ egrep '/said/|/the/' stop.txt > /said/ > /the/ > > Is this a bug or am I doing something wrong? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > -- Ted Pedersen http://www.d.umn.edu/~tpederse