[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719127#action_12719127
 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Simon, I want to address your comment on performance.
I think that surrogate detection is cheap when done right and I don't think 
there's a ton of places that need changes.
But I don't think any indicator is really appropriate, for example my 
TokenFilter might want to convert one chinese character in the BMP to another 
one outside of the BMP. It is all unicode.

But there is more than just analysis involved here, for example I have not 
tested WildcardQuery: ? operator.
I'm not trying to go berzerko and be 'ultra-correct', but basic things like 
that should work.
For situations where its not worth it, i.e. FuzzyQuery's scoring, we should 
just doc that the calculation is based on 'code units', and leave it alone.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to