Hi Jason, On 7/27/2009 at 3:15 PM, Jason Rutherglen wrote: > I'd like to enable ShingleFilter to only create shingles for a set of > (stop) words (rather than for all N tokens).
For purposes of discussion, here's some example input (first sentence from <http://en.wikipedia.org/wiki/Manufacturing>): Manufacturing is the use of machines, tools and labor to make things for use or sale. For n=2 and stoplist = { is, the, of, and, to, for, or }, and assuming WhitespaceAnalyzer, I think what you want is for ShingleFilter to *exclude* from output the following shingles (no unigrams output); since all other bigrams contain at least one stopword, they would be output: /machines, tools/ /make things/ Is this what you want? It might make sense, rather than modifying ShingleFilter, to create a new TokenFilter that can exclude terms you don't like. Solr has KeepWordFilter, which is close to what you want (the inverse of StopFilter), with the exception that you want to keep shingles that *contain* words from a list you supply. Perhaps a new TokenFilter subclass that can take in a regular expression would work? (Maybe called KeepRegexFilter.) Stopword lists are generally small enough to make building a regex to match them fairly simple, e.g. for the above list: (?:^|\s)(?:is|the|of|and|to|for|or)(?:\s|$) Alternatively/additionally, maybe a Keep{Term,Phrase,Keyword}Filter that takes in a list of words, then builds a regex like above? Having this functionality separate from ShingleFilter would be nice, I think, because it would be useful in other contexts. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org