[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795043#action_12795043
 ] 

Simon Willnauer commented on LUCENE-2183:
-----------------------------------------

Hey guys thanks for your comments.
when I started thinking about this issue I had a quick chat with robert and we 
figured that his solution could be working so I implemented it.
Yet, i found 2 problems with it.
1. If a user calls super.isTokenChar(char) and the super class has implemented 
the int method the UOE will never be thrown and the code does not behave like 
"expected" from the user perspective. - This is what robert explained above. We 
could solve this problem with reflection which leads to the second problem.

2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it 
relies on the superclass implementation of isTokenChar. Yet if we solve problem 
1. the user would be forced to override the isTokenChar to just call 
super.isTokenChar otherwise the reflection code will raise an exception that 
the int method is not implemented in the concrete class or will use the char 
API - anyway it will not do what is expected. 

Working around those two problem was the cause of a new API for CharTokenizer. 
My personal opinion is that inheritance is the wrong tool for changing behavior 
I used delegation (like a strategy) to on the one hand define a clear "new" API 
and decouple the code changing the behavior of the Tokenizer from the tokenizer 
itself. Inheritance for me is for extending a class and delegation is for 
changing behavior in this particular problem. 
Decoupling the old from the new has several advantages over a reflection / 
inheritance based solution. 
1. if a user does not provide a delegation impl he want to use the old API
2. if a user does provide a delegation impl he has still the ability to choose 
between charprocessing in 3.0 style or 3.1 style
3. no matter what is provided a user has full flexibility to choose the 
combination of their choice - old char processing - new int based api (maybe 
minor though)
4. we can leave all tokeinizer subclasses as their are and define new functions 
that implement their behavior in parallel. those functions can be made final 
from the beginning and which prevents users from subclassing them. (all of the 
existing ones should be final in my opinion - like LowerCaseTokenizer which 
should call Character.isLetter in the isTokenCodePoint(int) directly instead of 
subclassing another function.)

As a user I would expect lucene to revise their design decisions made years ago 
when there is a need for it like we have in this issue. It is easier to change 
behavior in user code by swapping to a new api instead of diggin into an 
workaround implementation of an old api silently calling a new API.



> Supplementary Character Handling in CharTokenizer
> -------------------------------------------------
>
>                 Key: LUCENE-2183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2183
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2183.patch
>
>
> CharTokenizer is an abstract base class for all Tokenizers operating on a 
> character level. Yet, those tokenizers still use char primitives instead of 
> int codepoints. CharTokenizer should operate on codepoints and preserve bw 
> compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to