[jira] [Commented] (LUCENE-4072) CharFilter that Unicode-normalizes input

Robert Muir (JIRA) Tue, 26 Nov 2013 07:35:18 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832668#comment-13832668
 ]


Robert Muir commented on LUCENE-4072:
-------------------------------------

Hi David: I havent taken a look at the impact of the ICU bug (i'm not really 
that familiar with the incremental normalization API), but it seems rather 
serious.

Is it possible to avoid use of hasBoundaryAfter? In addition to the bug you 
found, it has the warning that it may be significantly slower than 
hasBoundaryBefore: I'm wondering if we can dodge it.

> CharFilter that Unicode-normalizes input
> ----------------------------------------
>
>                 Key: LUCENE-4072
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4072
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Ippei UKAI
>         Attachments: DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, 
> LUCENE-4072.patch, LUCENE-4072.patch, 
> ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
>
>
> I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
> The benefit of having this process as CharFilter is that tokenizer can work 
> on normalised text while offset-correction ensuring fast vector highlighter 
> and other offset-dependent features do not break.
> The implementation is available at following repository:
> https://github.com/ippeiukai/ICUNormalizer2CharFilter
> Unfortunately this is my unpaid side-project and cannot spend much time to 
> merge my work to Lucene to make appropriate patch. I'd appreciate it if 
> anyone could give it a go. I'm happy to relicense it to whatever that meets 
> your needs.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4072) CharFilter that Unicode-normalizes input

Reply via email to