[
https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641071#action_12641071
]
Walter Underwood commented on SOLR-815:
---------------------------------------
I looked it up, and even found a reason to do it the right way.
Latin should be normalized to halfwidth (in the Latin-1 character space).
Kana should be normalized to fullwidth.
Normalizing Latin characters to fullwidth would mean you could not use the
existing accent-stripping filters or probably any other filter that expected
Latin-1, like synonyms. Normalizing to halfwidth makes the rest of Solr and
Lucene work as expected.
See section 12.5: http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf
The compatability forms (the ones we normalize away from) are int the Unicode
range U+FF00 to U+FFEF.
The correct mappings from those forms are in this doc:
http://www.unicode.org/charts/PDF/UFF00.pdf
Other charts are here: http://www.unicode.org/charts/
> Add new Japanese half-width/full-width normalizaton Filter and Factory
> ----------------------------------------------------------------------
>
> Key: SOLR-815
> URL: https://issues.apache.org/jira/browse/SOLR-815
> Project: Solr
> Issue Type: New Feature
> Components: search
> Affects Versions: 1.3
> Reporter: Todd Feak
> Assignee: Koji Sekiguchi
> Priority: Minor
> Attachments: SOLR-815.patch
>
>
> Japanese Katakana and Latin alphabet characters exist as both a "half-width"
> and "full-width" version. This new Filter normalizes to the full-width
> version to allow searching and indexing using both.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.