[
https://issues.apache.org/jira/browse/LUCY-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154693#comment-13154693
]
Marvin Humphrey commented on LUCY-191:
--------------------------------------
Great job, Nick! You really nailed it with this contribution.
* Public API and rough implementation design matches what you outlined and
built consensus for on the dev list.
* Proper documentation.
* Builds clean.
* Accompanied by tests and passes them.
* Passes test_valgrind.
* Looks portable.
I have a handful of minor suggestions, but as none of them are crucial, +1 to
commit verbatim.
Here are two things that I'd like to discuss on the dev list:
* Memory allocated with malloc() within utf8proc() should not necessarily be
freed with FREEMEM (which is an alias for lucy_Memory_wrapped_free.).
This happens to be safe right now, but that's an implementation detail of
Lucy::Util::Memory.
* The fact that utf8proc forces us to reallocate with each operation rather
than copying when possible as in SnowballStemmer is probably not optimal
from a performance standpoint.
Here are two tiny details:
* We can simplify the dump/load routines if we cache "form" within the
object as a member variable.
* The keywords "true" and "false" are available in Clownfish, and I think we
should use those as the defaults in the method signature for the boolean
args.
That's all I got right now! Nice work figuring out this patch with minimal
help.
> Unicode normalization
> ---------------------
>
> Key: LUCY-191
> URL: https://issues.apache.org/jira/browse/LUCY-191
> Project: Lucy
> Issue Type: New Feature
> Components: Analysis
> Reporter: Nick Wellnhofer
> Assignee: Marvin Humphrey
> Priority: Minor
> Labels: patch
> Fix For: 0.3.0 (incubating)
>
> Attachments: LUCY-191-normalizer.patch
>
>
> As discussed on the mailing list, it would be nice to have Unicode
> normalization, Unicode case folding and stripping of accents as part of the
> analyzer chain. With the help of utf8proc this can be done in one pass. So I
> proposed a new analyzer Lucy::Analyzer::Normalizer with an interface
> described here:
> http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/%3C4EC43816.1070107%40aevum.de%3E
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira