[ https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785341#action_12785341 ]
Robert Muir commented on LUCENE-2105: ------------------------------------- right there is a Filter in LUCENE-1488 for efficient unicode normalization. It implements .quickCheck() and works on char[] The only other alternative is the JDK6 impl, which would be a lot less efficient, String-based and only .isNormalized(), no .quickCheck() If people want me to break up LUCENE-1488 into smaller pieces and do them one piece at a time, we could go this route because the NormalizationFilter there IMHO is very clear, efficient, and will not change. On the other hand I like the idea of consistency in solving that issue as a whole, as Normalization interacts with other processes such as Case Folding. > Lucene does not support Unicode Normalization Forms > --------------------------------------------------- > > Key: LUCENE-2105 > URL: https://issues.apache.org/jira/browse/LUCENE-2105 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 3.0 > Reporter: Alexander Veit > > Lucene should bring terms in their Unicode normalization form > (http://unicode.org/reports/tr15/), probably NFKC. > E.g., currently words that contain ligatures such as "fi", "fl", "ff", or > "ffl" cannot be found in certain documents (try to find "undefined" in > http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org