[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

Robert Muir (JIRA) Tue, 29 Dec 2009 05:56:56 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795052#action_12795052
 ]


Robert Muir commented on LUCENE-2183:
-------------------------------------

{quote}
1. If a user calls super.isTokenChar(char) and the super class has implemented 
the int method the UOE will never be thrown and the code does not behave like 
"expected" from the user perspective. - This is what robert explained above. We 
could solve this problem with reflection which leads to the second problem.

2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it 
relies on the superclass implementation of isTokenChar. Yet if we solve problem 
1. the user would be forced to override the isTokenChar to just call 
super.isTokenChar otherwise the reflection code will raise an exception that 
the int method is not implemented in the concrete class or will use the char 
API - anyway it will not do what is expected. 
{quote}

i do not think this is true, what i was trying to do was modify the design i 
proposed so that we did not need reflection at all: but i think this is 
impossible. 

in the design you propose under the new api, subclassing is impossible, which I 
am not sure I like either.

#2 is no problem at all, instead the reflection code to address #1 must be 
implemented with these conditions 

* A is the class implementing method isTokenChar(int)
* B is the class implementing method isTokenChar(char)
* B is a subclass of A
* A is not CharTokenizer



> Supplementary Character Handling in CharTokenizer
> -------------------------------------------------
>
>                 Key: LUCENE-2183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2183
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2183.patch
>
>
> CharTokenizer is an abstract base class for all Tokenizers operating on a 
> character level. Yet, those tokenizers still use char primitives instead of 
> int codepoints. CharTokenizer should operate on codepoints and preserve bw 
> compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

Reply via email to