Hello,

I would like to ask if there is an interest to add TruncateTokenFilter to 
lucene.

I am using this filter as a stemmer for Turkish language. In many academic 
research (clustering, classification,retrieval) it is used and called as Fixed 
Prefix Stemmer or Simple Truncation Method or F5 in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish 
language in this [1]. It is the same work where some of stopwords_tr.txt are 
acquired. 

[1] "Information Retrieval on Turkish Texts"
http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf

ElasticSearch has this filter but it does not respect keyword attribute. 

Main advantage of F5 stemming is it does not effected by the meaning loss 
caused by ascii folding. It work well with ascii folding. 
[2] "Effects of diacritics on Turkish information retrieval" 
http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf

Here is the full type I use for customers 

 <fieldType name="text_tr_ascii_f5" class="solr.TextField" 
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>

I  would like to get community opinions on :

1) interest in this? Should I create a jira issue and attach what I have got
2) keyword attribute should be respected? 
3) package name analysis.misc versus analyis.tr 
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

Thanks,
Ahmet

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to