[jira] Commented: (LUCENE-2105) Lucene does not support Unicode Normalization Forms

Robert Muir (JIRA) Thu, 03 Dec 2009 06:31:47 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785341#action_12785341
 ]


Robert Muir commented on LUCENE-2105:
-------------------------------------

right there is a Filter in LUCENE-1488 for efficient unicode normalization. It 
implements .quickCheck() and works on char[]

The only other alternative is the JDK6 impl, which would be a lot less 
efficient, String-based and only .isNormalized(), no .quickCheck()

If people want me to break up LUCENE-1488 into smaller pieces and do them one 
piece at a time, we could go this route because the NormalizationFilter there 
IMHO is very clear, efficient, and will not change.

On the other hand I like the idea of consistency in solving that issue as a 
whole, as Normalization interacts with other processes such as Case Folding.


> Lucene does not support Unicode Normalization Forms
> ---------------------------------------------------
>
>                 Key: LUCENE-2105
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2105
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Alexander Veit
>
> Lucene should bring terms in their Unicode normalization form 
> (http://unicode.org/reports/tr15/), probably NFKC.
> E.g., currently words that contain ligatures such as "fi", "fl", "ff", or 
> "ffl" cannot be found in certain documents (try to find "undefined" in 
> http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2105) Lucene does not support Unicode Normalization Forms

Reply via email to