It seems that the only limiting factor is that the regular expressions
in the stop-word list are run on the individual words rather than the
whole ngram.

Perhaps you could run count.pl without using a stop-word list then use
e.g. sed to filter the results.  This would allow you to apply regular
expressions to the whole ngram.



--- In ngram@yahoogroups.com, Ted Pedersen <tpede...@...> wrote:
>
> This is a great question, and the short answer is NSP does not support
> this kind of stopword filtering (although it would clearly be a good
> thing to provide).
> 
> As a quick review for others...
> 
> The stopword mechanism in NSP allows you to either filter Ngrams that
> are completely made up of stopwords ('and' mode), or to filter Ngrams
> that contain one or more stop words ('or' mode) without regard to
> position. Which mode you get depends on how you set up your stoplist
> file...your stoplist file should start with mode, and then be followed
> by regular expressions representing the tokens you'd like to have
> considered as stop words...
> 
> @stop.mode=AND
> /\bthe\b/
> /\bfor\b/
> 
> or
> 
> @stop.mode=OR
> /\bthe\b/
> /\bfor\b/
> 
> The OR list would filter out "the united states" while the AND list
> would let that be used (since not all words are in the stoplist). If
> you don't specify the stop.mode you get AND by default...
> 
> I'll note this as an excellent suggestion, and take a look a twhat
> would be involved in supporting it. For now though I can't think of a
> good way to do this with NSP.
> 
> Cordially,
> Ted
> 
> On Thu, Jan 29, 2009 at 11:11 AM, mercevg <merc...@...> wrote:
> > Dear all,
> >
> > I would like to know if it's possible with NSP not to filter stopwords
> > inside of tri-grams. In my results list I just want to filter
> > stopwords placed in the first and last position of a tri-gram.
> >
> > As a exemple, in a sentence like this:
> > "Data of variable length (the operand) is preceded by an opcode."
> >
> > I would like to get as a result list "data of variable" and not "Data
> > variable length".
> >
> > Best whishes,
> > Mercè
> >
> > 
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Reply via email to