Ivan Provalov created LUCENE-7321:
-------------------------------------
Summary: Character Mapping
Key: LUCENE-7321
URL: https://issues.apache.org/jira/browse/LUCENE-7321
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 6.0.1, 5.4.1, 6.0, 4.6.1
Reporter: Ivan Provalov
Priority: Minor
Fix For: 6.0.1
One of the challenges in search is recall of an item with a common typing
variant. These cases can be as simple as lower/upper case in most languages,
accented characters, or more complex morphological phenomena like prefix
omitting, or constructing a character with some combining mark. This component
addresses the cases, which are not covered by ASCII folding component, or more
complex to design with other tools. The idea is that a linguist could provide
the mappings in a tab-delimited file, which then can be directly used by Solr.
The mappings are maintained in the tab-delimited file, which could be just a
copy paste from Excel spreadsheet. This gives the linguists the opportunity to
create the mappings, then for the developer to include them in Solr
configuration. There are a few cases, when the mappings grow complex, where
some additional debugging may be required. The mappings can contain any
sequence of characters to any other sequence of characters.
Some of the cases I discuss in detail document are handling the voiced vowels
for Japanese; common typing substitutions for Korean, Russian, Polish;
transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding
for Japanese.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]