Hi

Is there a way to remove just the leading and trailing stopwords from a token n-gram?

Currently I have the following combination which removes any n-gram that contains a stopword:

<analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                <filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=" "/> <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
</analyzer>

For example, if my document contains these ngrams: "Tower of London", "Tower in London", "and London", "London", with "of, in" as stopwords, the single filter will produce:
tower _ london, tower _ london, _ london, london
(note that however the second "tower _ london" is different from the first but this bit of information is lost)

and the pattern filter will then delete the first 3 n-grams.

What I really want to do though is to keep "tower of london", "tower in london", "london", "london".

Is this possible?

Many thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to