[
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594154#comment-16594154
]
Ivan Provalov commented on LUCENE-7321:
---------------------------------------
[~arafalov], the clean use case is for this filter is to externalize the
morphological modifications rules. Most stemmers have hard-coded rules. With
this one, the rules are expressed in the flat mapping files and configurations.
Originally, it was developed to extend a few cases for some languages listed
here and a few other languages, as well as to visualize these rules which would
help the linguists involved in the project to understand the modification rules
for more complex scenarios. I added the Russian stemmer implementation as a
general reference just to show how one can configure the entire stemmer
implementation without hard-coded rules. We have not seen any performance
issues with this so far. Hope this helps.
> Character Mapping
> -----------------
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
> Reporter: Ivan Provalov
> Priority: Minor
> Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing
> variant. These cases can be as simple as lower/upper case in most languages,
> accented characters, or more complex morphological phenomena like prefix
> omitting, or constructing a character with some combining mark. This
> component addresses the cases, which are not covered by ASCII folding
> component, or more complex to design with other tools. The idea is that a
> linguist could provide the mappings in a tab-delimited file, which then can
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a
> copy paste from Excel spreadsheet. This gives the linguists the opportunity
> to create the mappings, then for the developer to include them in Solr
> configuration. There are a few cases, when the mappings grow complex, where
> some additional debugging may be required. The mappings can contain any
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels
> for Japanese; common typing substitutions for Korean, Russian, Polish;
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding
> for Japanese. In the appendix, I give an example of implementing a Russian
> light weight stemmer using this component.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]