Hi Jason,

On 7/27/2009 at 3:15 PM, Jason Rutherglen wrote:
> I'd like to enable ShingleFilter to only create shingles for a set of
> (stop) words (rather than for all N tokens).

For purposes of discussion, here's some example input (first sentence from 
<http://en.wikipedia.org/wiki/Manufacturing>):

        Manufacturing is the use of machines, tools and labor
        to make things for use or sale.

For n=2 and stoplist = { is, the, of, and, to, for, or }, and assuming 
WhitespaceAnalyzer, I think what you want is for ShingleFilter to *exclude* 
from output the following shingles (no unigrams output); since all other 
bigrams contain at least one stopword, they would be output:

        /machines, tools/
        /make things/

Is this what you want?

It might make sense, rather than modifying ShingleFilter, to create a new 
TokenFilter that can exclude terms you don't like.

Solr has KeepWordFilter, which is close to what you want (the inverse of 
StopFilter), with the exception that you want to keep shingles that *contain* 
words from a list you supply.

Perhaps a new TokenFilter subclass that can take in a regular expression would 
work?  (Maybe called KeepRegexFilter.)  Stopword lists are generally small 
enough to make building a regex to match them fairly simple, e.g. for the above 
list:

        (?:^|\s)(?:is|the|of|and|to|for|or)(?:\s|$)

Alternatively/additionally, maybe a Keep{Term,Phrase,Keyword}Filter that takes 
in a list of words, then builds a regex like above?

Having this functionality separate from ShingleFilter would be nice, I think, 
because it would be useful in other contexts.

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to