Sounds interesting.

+1 for FixedPefixStemFilter.

Default prefixLength to 5.

-- Jack Krupansky

-----Original Message----- From: Ahmet Arslan
Sent: Thursday, March 27, 2014 5:53 AM
To: [email protected]
Subject: TruncateTokenFilter FixedPrefixStemFilter

Hello,

I would like to ask if there is an interest to add TruncateTokenFilter to lucene.

I am using this filter as a stemmer for Turkish language. In many academic research (clustering, classification,retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in this [1]. It is the same work where some of stopwords_tr.txt are acquired.

[1] "Information Retrieval on Turkish Texts"
http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf

ElasticSearch has this filter but it does not respect keyword attribute.

Main advantage of F5 stemming is it does not effected by the meaning loss caused by ascii folding. It work well with ascii folding. [2] "Effects of diacritics on Turkish information retrieval" http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf

Here is the full type I use for customers

<fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.KeywordRepeatFilterFactory"/>
    <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>

I  would like to get community opinions on :

1) interest in this? Should I create a jira issue and attach what I have got
2) keyword attribute should be respected?
3) package name analysis.misc versus analyis.tr
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter

Thanks,
Ahmet

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to