[
https://issues.apache.org/jira/browse/LUCY-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13155247#comment-13155247
]
Nick Wellnhofer commented on LUCY-191:
--------------------------------------
I have some upcoming changes to my patch that address a bug regarding the
return type of utf8proc_map, coding style, and the boolean literals.
Regarding memory allocation: If we don't call utf8proc_map directly, we can use
our own allocation routines and also implement some optimizations. I'd propose
a fixed size work buffer that is large enough to hold the intermediate results
for input strings up to a certain length. This should save one allocation for
most tokens. If the normalized result isn't larger than the input, we can also
copy it back to the original token avoiding another reallocation.
> Unicode normalization
> ---------------------
>
> Key: LUCY-191
> URL: https://issues.apache.org/jira/browse/LUCY-191
> Project: Lucy
> Issue Type: New Feature
> Components: Analysis
> Reporter: Nick Wellnhofer
> Assignee: Marvin Humphrey
> Priority: Minor
> Labels: patch
> Fix For: 0.3.0 (incubating)
>
> Attachments: LUCY-191-normalizer.patch
>
>
> As discussed on the mailing list, it would be nice to have Unicode
> normalization, Unicode case folding and stripping of accents as part of the
> analyzer chain. With the help of utf8proc this can be done in one pass. So I
> proposed a new analyzer Lucy::Analyzer::Normalizer with an interface
> described here:
> http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/%3C4EC43816.1070107%40aevum.de%3E
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira