[ngram] Re: Possible count.pl bug

Ted Pedersen Fri, 17 Oct 2008 11:47:13 -0700

Hi Otis,

I looked at your stoplist, and I think the problem is in how the
regular expressions are constructed...


Here are just a couple of the entries, but the problem is that these
regular expressions will match any string that contains an "a" (the
first one), and any string that contains "able" (the second one).

/a/
/able/

You'd want to insert a word boundary at the start and end of these to
avoid that, as in :

/\ba\b/
\bable\b/

Then I think things will work as you expect!

I hope this helps, let us know as other questions arise.

Cordially,
Ted

On Fri, Oct 17, 2008 at 12:06 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Thanks Ted.  That takes care of the stop words, but I think something else 
> funky happens now:
>
> $ cat simple.txt
> This is Barack Obama who is building a house for Barack Obama family.
> I am John McCain and I'm the maverick like you've never seen before. I love 
> Alaska.
> When I was a young lad I spent summers building houses in Alaska.
>
> $ count.pl -stop stop.txt -ngram 1 count.txt simple.txt
>
> $ cat count.txt
> 9
> I<>5
> .<>4
>
>
> Let's try again without -stop:
>
> $ rm count.txt
>
> $ count.pl -ngram 1 count.txt simple.txt
>
> $ head count.txt
> 48
> I<>5
> .<>4
> is<>2
> Obama<>2
> a<>2
> building<>2
> Alaska<>2
> Barack<>2
> houses<>1
>
>
> For some reason using -stop messed things up.  Nothing funky in my stop.txt 
> (attached) I believe.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
>> From: Ted Pedersen <[EMAIL PROTECTED]>
>> To: Otis Gospodnetic <[EMAIL PROTECTED]>
>> Cc: ngram@yahoogroups.com
>> Sent: Thursday, October 16, 2008 7:26:27 PM
>> Subject: Re: Possible count.pl bug
>>
>> Hi Otis,
>>
>> It's great to hear you are using NSP! The stop lists have two
>> different "modes" in which they operate - OR and AND mode. By default
>> they are used in AND mode, where a bigram must consist of two stop
>> words to be removed (that is both words must be stop words). It sounds
>> like you would like to use the OR mode, where a bigram would be
>> eliminated if either word is a stop word. You can do that by
>> specifying OR mode on the first line of your stop.txt file.
>>
>> @stop.mode=OR
>> /said/
>> /the/
>>
>> This should result in a list more to your liking! Notice that if you
>> use --ngram 1 then the OR or AND doesn't matter, since any unigram
>> that is a stop word will be removed. For ngrams greater than 2, AND
>> and OR stop modes operate as expected - AND requiring that all n words
>> be stop words to be removed, while OR would eliminate them if any
>> single word is stop word.
>>
>> I hope this all makes sense. More on these issues here :
>>
>> http://search.cpan.org/dist/Text-NSP/doc/README.pod#5.6._%22Stopping%22_the_Ngrams:
>>
>> Please let us know if there are any additional questions or suggestions!
>>
>> Cordially,
>> Ted
>>
>> On Thu, Oct 16, 2008 at 2:51 PM, Otis Gospodnetic wrote:
>> > Hello Ted,
>> >
>> > I was playing with Text::NSP, count.pl in particular, and I might be 
>> > seeing a
>> small bug.
>> > I ran it against some news articles, like this:
>> >
>> > $ count.pl -stop stop.txt -frequency 5 -window 4 -hist hist.txt count.txt
>> a1.txt
>> >
>> > This produced count.txt with:
>> >
>> > 636
>> > .<>Obama<>11 129 21
>> > ,<>said<>9 126 13
>> > ,<>Obama<>7 126 21
>> > the<>.<>7 15 132
>> > ,<>McCain<>6 126 11
>> > ,<>.<>6 126 132
>> > said<>.<>6 9 132
>> > .<>,<>5 129 124
>> > Obama<>.<>5 13 132
>> > the<>,<>5 15 124
>> > .<>The<>5 129 8
>> > in<>.<>5 7 132
>> >
>> > Note all those stop words in there.  I'd like to get rid of them and I 
>> > think
>> that's what that -stop stop.txt should do, no?
>> >
>> > $ egrep '/said/|/the/' stop.txt
>> > /said/
>> > /the/
>> >
>> > Is this a bug or am I doing something wrong?
>> >
>> > Thanks,
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >
>> >
>>
>>
>>
>> --
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[ngram] Re: Possible count.pl bug

Reply via email to