[
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740866#action_12740866
]
Michael McCandless commented on LUCENE-1689:
--------------------------------------------
{quote}
I think instead of the way I prototyped in the first patch, it might be better
to have the original chartokenizer incrementToken definition still available in
the code.
this is some temporary code duplication but would perform better for the
backwards compat case, and the backwards compatibility would be more clear to
me at least.
{quote}
I suppose we could simply make an entirely new class, which properly handles
surrogates, and deprecate CharTokenizer in favor of it? Likewise we'd have to
make new classes for the current subclasses of CharTokenizer
(Whitespace,LetterTokenizer). That would simplify being back compatible.
> supplementary character handling
> --------------------------------
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt,
> testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be
> changed so they don't actually remove suppl characters, or modified to look
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar()
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and
> suppl characters should be the exception, but still work.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]