[lucy-issues] [jira] [Commented] (LUCY-191) Unicode normalization

Nick Wellnhofer (Commented) (JIRA) Tue, 22 Nov 2011 08:47:05 -0800

    [ 
https://issues.apache.org/jira/browse/LUCY-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13155247#comment-13155247
 ]


Nick Wellnhofer commented on LUCY-191:
--------------------------------------

I have some upcoming changes to my patch that address a bug regarding the 
return type of utf8proc_map, coding style, and the boolean literals.

Regarding memory allocation: If we don't call utf8proc_map directly, we can use 
our own allocation routines and also implement some optimizations. I'd propose 
a fixed size work buffer that is large enough to hold the intermediate results 
for input strings up to a certain length. This should save one allocation for 
most tokens. If the normalized result isn't larger than the input, we can also 
copy it back to the original token avoiding another reallocation.
                
> Unicode normalization
> ---------------------
>
>                 Key: LUCY-191
>                 URL: https://issues.apache.org/jira/browse/LUCY-191
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Nick Wellnhofer
>            Assignee: Marvin Humphrey
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.3.0 (incubating)
>
>         Attachments: LUCY-191-normalizer.patch
>
>
> As discussed on the mailing list, it would be nice to have Unicode 
> normalization, Unicode case folding and stripping of accents as part of the 
> analyzer chain. With the help of utf8proc this can be done in one pass. So I 
> proposed a new analyzer Lucy::Analyzer::Normalizer with an interface 
> described here:
> http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/%3C4EC43816.1070107%40aevum.de%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-191) Unicode normalization

Reply via email to