[lucy-issues] [jira] [Commented] (LUCY-191) Unicode normalization

Marvin Humphrey (Commented) (JIRA) Mon, 21 Nov 2011 14:46:05 -0800

    [ 
https://issues.apache.org/jira/browse/LUCY-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154693#comment-13154693
 ]


Marvin Humphrey commented on LUCY-191:
--------------------------------------

Great job, Nick!  You really nailed it with this contribution.

  * Public API and rough implementation design matches what you outlined and
    built consensus for on the dev list.
  * Proper documentation.
  * Builds clean.
  * Accompanied by tests and passes them.
  * Passes test_valgrind.
  * Looks portable.

I have a handful of minor suggestions, but as none of them are crucial, +1 to
commit verbatim.

Here are two things that I'd like to discuss on the dev list:

  * Memory allocated with malloc() within utf8proc() should not necessarily be
    freed with FREEMEM (which is an alias for lucy_Memory_wrapped_free.).
    This happens to be safe right now, but that's an implementation detail of
    Lucy::Util::Memory.
  * The fact that utf8proc forces us to reallocate with each operation rather
    than copying when possible as in SnowballStemmer is probably not optimal
    from a performance standpoint.

Here are two tiny details:

  * We can simplify the dump/load routines if we cache "form" within the
    object as a member variable.  
  * The keywords "true" and "false" are available in Clownfish, and I think we
    should use those as the defaults in the method signature for the boolean
    args.

That's all I got right now!  Nice work figuring out this patch with minimal
help.
                
> Unicode normalization
> ---------------------
>
>                 Key: LUCY-191
>                 URL: https://issues.apache.org/jira/browse/LUCY-191
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Nick Wellnhofer
>            Assignee: Marvin Humphrey
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.3.0 (incubating)
>
>         Attachments: LUCY-191-normalizer.patch
>
>
> As discussed on the mailing list, it would be nice to have Unicode 
> normalization, Unicode case folding and stripping of accents as part of the 
> analyzer chain. With the help of utf8proc this can be done in one pass. So I 
> proposed a new analyzer Lucy::Analyzer::Normalizer with an interface 
> described here:
> http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/%3C4EC43816.1070107%40aevum.de%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-191) Unicode normalization

Reply via email to