[
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794964#action_12794964
]
Robert Muir commented on LUCENE-2183:
-------------------------------------
I thought about this some, but i am worried about one thing:
Consider LetterTokenizer, which is non-final subclass of CharTokenizer.
Lets say you make LetterAndNumberTokenizer which extends LetterTokenizer, but
you do not implement the int-based method.
{code}
public boolean isTokenChar(char c) {
return super.isTokenChar(c) || Character.isNumber(c);
}
{code}
we have fixed LetterTokenizer so it has isTokenChar(int), but that means if
someone tries to use this LettterAndNumberTokenizer with Version.LUCENE_31, it
will not work, because it will not throw UOE, and silently discard numbers...
since it will call the LetterTokenizer int-based method.
of course it will work correctly with Version.LUCENE_30, so it is not a back
compat problem, but it will not throw UOE and silently behave incorrectly for
LUCENE_31 until the 'int' method is implemented.
so i think this is a problem in this design, and i do not know how to fix
without reflection.
> Supplementary Character Handling in CharTokenizer
> -------------------------------------------------
>
> Key: LUCENE-2183
> URL: https://issues.apache.org/jira/browse/LUCENE-2183
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2183.patch
>
>
> CharTokenizer is an abstract base class for all Tokenizers operating on a
> character level. Yet, those tokenizers still use char primitives instead of
> int codepoints. CharTokenizer should operate on codepoints and preserve bw
> compatibility.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]