Ahmet Arslan created LUCENE-5558:
------------------------------------

             Summary: Add TruncateTokenFilter
                 Key: LUCENE-5558
                 URL: https://issues.apache.org/jira/browse/LUCENE-5558
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
    Affects Versions: 4.7
            Reporter: Ahmet Arslan
            Priority: Minor
             Fix For: 4.8


I am using this filter as a stemmer for Turkish language. In many academic 
research (classification, retrieval) it is used and called as Fixed Prefix 
Stemmer or Simple Truncation Method or F5 in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish 
language in [Information Retrieval on Turkish 
Texts|http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf]. It is 
the same work where most of stopwords_tr.txt are acquired. 

ElasticSearch has 
[truncate|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html]
 filter but it does not respect keyword attribute. And it has a use case 
similar to TruncateFieldUpdateProcessorFactory

Main advantage of F5 stemming is : it does not effected by the meaning loss 
caused by ascii folding. It is a diacritics-insensitive stemmer and works well 
with ascii folding. [Effects of diacritics on Turkish information 
retrieval|http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf]

Here is the full field type I use for "diacritics-insensitive search" for 
Turkish
{code:xml}
 <fieldType name="text_tr_ascii_f5" class="solr.TextField" 
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
{code}

I  would like to get community opinions :

1) Any interest in this? 
2) keyword attribute should be respected? 
3) package name analysis.misc versus analyis.tr 
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to