George Rhoten created LUCENE-5224:
-------------------------------------
Summary: org.apache.lucene.analysis.hunspell.HunspellDictionary
should implement ICONV and OCONV lines in the affix file
Key: LUCENE-5224
URL: https://issues.apache.org/jira/browse/LUCENE-5224
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 4.4, 4.0
Reporter: George Rhoten
There are some Hunspell dictionaries that need to emulate Unicode normalization
and collation in order to get the correct stem of a word. The original Hunspell
provides a way to do this with the ICONV and OCONV lines in the affix file. The
Lucene HunspellDictionary ignores these lines right now.
Please support these keys in the affix file.
This bit of functionality is briefly described in the hunspell man page
http://manpages.ubuntu.com/manpages/lucid/man4/hunspell.4.html
This functionality is practically required in order to use a Korean dictionary
because you want only some of the Jamos of a Hangul character (grapheme
cluster) when using stemming. Other languages will find this to be helpful
functionality.
Here is an example for a .aff file:
{code}
ICONV 각 각
...
OCONV 각 각
{code}
Here is the same example escaped.
{code}
ICONV \uAC01 \u1100\u1161\u11A8
...
OCONV \u1100\u1161\u11A8 \uAC01
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]