[jira] Commented: (LUCENE-1689) supplementary character handling

Simon Willnauer (JIRA) Sat, 13 Jun 2009 07:23:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719125#action_12719125
 ]


Simon Willnauer commented on LUCENE-1689:
-----------------------------------------

The scary thing is that this happens already if you run lucene on a 1.5 VM even 
without introducing 1.5 code. 
I think we need to act on this issue asap and release it together with 3.0. -> 
ful support for unicode 4.0 in lucene 3.0 
I also thought about the implementation a little bit. The need to detect chars 
> BMP and operate on those might be spread out across lucene (quite a couple of 
analyzers and filters etc). Performance could truely suffer from this if it is 
done "wrong" or even more than once. It might be considerable to make the 
detection pluggable with an initial filter that only checks where surrogates 
are present in a token and sets an indicator to the token represenation so that 
subsequent TokenStreams can operate on it without rechecking. This would also 
preserve performance for those who do not need chars > BMP (which could be 
quite a large amout of people). Those could then simply not supply such a 
initial filter.

Just a couple of random thoughts.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1689) supplementary character handling

Reply via email to