Sounds interesting.
+1 for FixedPefixStemFilter.
Default prefixLength to 5.
-- Jack Krupansky
-----Original Message-----
From: Ahmet Arslan
Sent: Thursday, March 27, 2014 5:53 AM
To: [email protected]
Subject: TruncateTokenFilter FixedPrefixStemFilter
Hello,
I would like to ask if there is an interest to add TruncateTokenFilter to
lucene.
I am using this filter as a stemmer for Turkish language. In many academic
research (clustering, classification,retrieval) it is used and called as
Fixed Prefix Stemmer or Simple Truncation Method or F5 in short.
Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish
language in this [1]. It is the same work where some of stopwords_tr.txt are
acquired.
[1] "Information Retrieval on Turkish Texts"
http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf
ElasticSearch has this filter but it does not respect keyword attribute.
Main advantage of F5 stemming is it does not effected by the meaning loss
caused by ascii folding. It work well with ascii folding.
[2] "Effects of diacritics on Turkish information retrieval"
http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf
Here is the full type I use for customers
<fieldType name="text_tr_ascii_f5" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
I would like to get community opinions on :
1) interest in this? Should I create a jira issue and attach what I have got
2) keyword attribute should be respected?
3) package name analysis.misc versus analyis.tr
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter
Thanks,
Ahmet
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]