[ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719125#action_12719125 ]
Simon Willnauer commented on LUCENE-1689: ----------------------------------------- The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code. I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0 I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter. Just a couple of random thoughts. > supplementary character handling > -------------------------------- > > Key: LUCENE-1689 > URL: https://issues.apache.org/jira/browse/LUCENE-1689 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Robert Muir > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1689_lowercase_example.txt > > > for Java 5. Java 5 is based on unicode 4, which means variable-width encoding. > supplementary character support should be fixed for code that works with > char/char[] > For example: > StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be > changed so they don't actually remove suppl characters, or modified to look > for surrogates and behave correctly. > LowercaseFilter should be modified to lowercase suppl. characters correctly. > CharTokenizer should either be deprecated or changed so that isTokenChar() > and normalize() use int. > in all of these cases code should remain optimized for the BMP case, and > suppl characters should be the exception, but still work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org