[
https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785341#action_12785341
]
Robert Muir commented on LUCENE-2105:
-------------------------------------
right there is a Filter in LUCENE-1488 for efficient unicode normalization. It
implements .quickCheck() and works on char[]
The only other alternative is the JDK6 impl, which would be a lot less
efficient, String-based and only .isNormalized(), no .quickCheck()
If people want me to break up LUCENE-1488 into smaller pieces and do them one
piece at a time, we could go this route because the NormalizationFilter there
IMHO is very clear, efficient, and will not change.
On the other hand I like the idea of consistency in solving that issue as a
whole, as Normalization interacts with other processes such as Case Folding.
> Lucene does not support Unicode Normalization Forms
> ---------------------------------------------------
>
> Key: LUCENE-2105
> URL: https://issues.apache.org/jira/browse/LUCENE-2105
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Alexander Veit
>
> Lucene should bring terms in their Unicode normalization form
> (http://unicode.org/reports/tr15/), probably NFKC.
> E.g., currently words that contain ligatures such as "fi", "fl", "ff", or
> "ffl" cannot be found in certain documents (try to find "undefined" in
> http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]