[ https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785302#action_12785302 ]
DM Smith commented on LUCENE-2105: ---------------------------------- Is this a duplicate or solved by LUCENE-1488? It provides for an ICUNormalizationFilter that looks like it will do the trick. The only problem with LUCENE-1488 solving the problem is that it won't be solved in core or without a 3-rd party library. I may be wrong, but as I understand it, complete Unicode normalization is the responsibility of the user of Lucene. As pointed out in the JavaDoc for ICUNormalizationFilter, sometimes it needs to be place within the chain of filters and not merely before tokenization. > Lucene does not support Unicode Normalization Forms > --------------------------------------------------- > > Key: LUCENE-2105 > URL: https://issues.apache.org/jira/browse/LUCENE-2105 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 3.0 > Reporter: Alexander Veit > > Lucene should bring terms in their Unicode normalization form > (http://unicode.org/reports/tr15/), probably NFKC. > E.g., currently words that contain ligatures such as "fi", "fl", "ff", or > "ffl" cannot be found in certain documents (try to find "undefined" in > http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org