[
https://issues.apache.org/jira/browse/LUCENE-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956095#comment-13956095
]
ASF subversion and git services commented on LUCENE-5558:
---------------------------------------------------------
Commit 1583527 from [~rcmuir] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1583527 ]
LUCENE-5558: Add TruncateTokenFilter
> Add TruncateTokenFilter
> -----------------------
>
> Key: LUCENE-5558
> URL: https://issues.apache.org/jira/browse/LUCENE-5558
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 4.7
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Labels: Turkish, f5
> Fix For: 4.8
>
> Attachments: LUCENE-5558.patch, LUCENE-5558.patch, LUCENE-5558.patch,
> LUCENE-5558.patch
>
>
> I am using this filter as a stemmer for Turkish language. In many academic
> research (classification, retrieval) it is used and called as Fixed Prefix
> Stemmer or Simple Truncation Method or F5 in short.
> Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish
> language in [Information Retrieval on Turkish
> Texts|http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf]. It is
> the same work where most of stopwords_tr.txt are acquired.
> ElasticSearch has
> [truncate|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html]
> filter but it does not respect keyword attribute. And it has a use case
> similar to TruncateFieldUpdateProcessorFactory
> Main advantage of F5 stemming is : it does not effected by the meaning loss
> caused by ascii folding. It is a diacritics-insensitive stemmer and works
> well with ascii folding. [Effects of diacritics on Turkish information
> retrieval|http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf]
> Here is the full field type I use for "diacritics-insensitive search" for
> Turkish
> {code:xml}
> <fieldType name="text_tr_ascii_f5" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ApostropheFilterFactory"/>
> <filter class="solr.TurkishLowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.KeywordRepeatFilterFactory"/>
> <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> {code}
> I would like to get community opinions :
> 1) Any interest in this?
> 2) keyword attribute should be respected?
> 3) package name analysis.misc versus analyis.tr
> 4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]