[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795043#action_12795043 ]
Simon Willnauer commented on LUCENE-2183: ----------------------------------------- Hey guys thanks for your comments. when I started thinking about this issue I had a quick chat with robert and we figured that his solution could be working so I implemented it. Yet, i found 2 problems with it. 1. If a user calls super.isTokenChar(char) and the super class has implemented the int method the UOE will never be thrown and the code does not behave like "expected" from the user perspective. - This is what robert explained above. We could solve this problem with reflection which leads to the second problem. 2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it relies on the superclass implementation of isTokenChar. Yet if we solve problem 1. the user would be forced to override the isTokenChar to just call super.isTokenChar otherwise the reflection code will raise an exception that the int method is not implemented in the concrete class or will use the char API - anyway it will not do what is expected. Working around those two problem was the cause of a new API for CharTokenizer. My personal opinion is that inheritance is the wrong tool for changing behavior I used delegation (like a strategy) to on the one hand define a clear "new" API and decouple the code changing the behavior of the Tokenizer from the tokenizer itself. Inheritance for me is for extending a class and delegation is for changing behavior in this particular problem. Decoupling the old from the new has several advantages over a reflection / inheritance based solution. 1. if a user does not provide a delegation impl he want to use the old API 2. if a user does provide a delegation impl he has still the ability to choose between charprocessing in 3.0 style or 3.1 style 3. no matter what is provided a user has full flexibility to choose the combination of their choice - old char processing - new int based api (maybe minor though) 4. we can leave all tokeinizer subclasses as their are and define new functions that implement their behavior in parallel. those functions can be made final from the beginning and which prevents users from subclassing them. (all of the existing ones should be final in my opinion - like LowerCaseTokenizer which should call Character.isLetter in the isTokenCodePoint(int) directly instead of subclassing another function.) As a user I would expect lucene to revise their design decisions made years ago when there is a need for it like we have in this issue. It is easier to change behavior in user code by swapping to a new api instead of diggin into an workaround implementation of an old api silently calling a new API. > Supplementary Character Handling in CharTokenizer > ------------------------------------------------- > > Key: LUCENE-2183 > URL: https://issues.apache.org/jira/browse/LUCENE-2183 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2183.patch > > > CharTokenizer is an abstract base class for all Tokenizers operating on a > character level. Yet, those tokenizers still use char primitives instead of > int codepoints. CharTokenizer should operate on codepoints and preserve bw > compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org