token n-gram: leading and trailing stopwords removal only

Ziqi Zhang Fri, 25 Sep 2015 02:16:30 -0700

Hi

Is there a way to remove just the leading and trailing stopwords from atoken n-gram?

Currently I have the following combination which removes any n-gram thatcontains a stopword:


<analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />

<filter class="solr.StopFilterFactory"ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

                <filter class="solr.LowerCaseFilterFactory" />

<filter class="solr.ShingleFilterFactory"minShingleSize="2" maxShingleSize="3"outputUnigrams="true"outputUnigramsIfNoShingles="false" tokenSeparator=" "/><filter class="solr.PatternReplaceFilterFactory"pattern=".*_.*" replacement=""/>

</analyzer>

For example, if my document contains these ngrams: "Tower of London","Tower in London", "and London", "London", with "of, in" as stopwords,the single filter will produce:

tower _ london, tower _ london, _ london, london

(note that however the second "tower _ london" is different from thefirst but this bit of information is lost)


and the pattern filter will then delete the first 3 n-grams.

What I really want to do though is to keep "tower of london", "tower inlondon", "london", "london".


Is this possible?

Many thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

token n-gram: leading and trailing stopwords removal only

Reply via email to