[jira] Commented: (LUCENE-2105) Lucene does not support Unicode Normalization Forms

DM Smith (JIRA) Thu, 03 Dec 2009 05:14:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785302#action_12785302
 ]


DM Smith commented on LUCENE-2105:
----------------------------------

Is this a duplicate or solved by LUCENE-1488? It provides for an 
ICUNormalizationFilter that looks like it will do the trick.

The only problem with LUCENE-1488 solving the problem is that it won't be 
solved in core or without a 3-rd party library.

I may be wrong, but as I understand it, complete Unicode normalization is the 
responsibility of the user of Lucene. As pointed out in the JavaDoc for 
ICUNormalizationFilter, sometimes it needs to be place within the chain of 
filters and not merely before tokenization.

> Lucene does not support Unicode Normalization Forms
> ---------------------------------------------------
>
>                 Key: LUCENE-2105
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2105
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Alexander Veit
>
> Lucene should bring terms in their Unicode normalization form 
> (http://unicode.org/reports/tr15/), probably NFKC.
> E.g., currently words that contain ligatures such as "fi", "fl", "ff", or 
> "ffl" cannot be found in certain documents (try to find "undefined" in 
> http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2105) Lucene does not support Unicode Normalization Forms

Reply via email to