[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

Simon Willnauer (JIRA) Tue, 29 Dec 2009 10:04:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795091#action_12795091
 ]


Simon Willnauer commented on LUCENE-2183:
-----------------------------------------

{quote}
#2 is no problem at all, instead the reflection code to address #1 must be 
implemented with these conditions

    * A is the class implementing method isTokenChar(int)
    * B is the class implementing method isTokenChar(char)
    * B is a subclass of A
    * A is not CharTokenizer
{quote}

ok here is a scenario:
{code}
class MySmartDeseretTokenizer extends LetterTokenizer {
  
  public boolean isTokenChar(char c) {
    // we trust that DeseretHighLow surrogates are never unpaired
    return super.isTokenChar(c) || isDeseretHighLowSurrogate(c);
  }

  public char nomalize(char c) {
    if(isDeseretHighSurrogate(c))
      return c;
    if(isDeseretLowSurrogate(c))
     return lowerCaseDeseret('\ud801', c)[1];
    return Character.toLowercase(c);
  }

  public int normalize(int c) {
    return Character.toLowerCase(c);
  }
}

{code}

if somebody has similar code like this they might want to preserve compat 
because they have different versions of their app. Yet the old app only 
supports deseret high surrogates but the new one accepts all letter 
supplementary chars due to super.isTokenChar(int). This scenario will break our 
reflection solution and users might be disappointed though as the new api is 
there to bring the unicode support. I don't say this scenario exists but it 
could be a valid one for a very special usecase. 

I don't say my proposal is THE way to go but I really don't want to use 
reflection - this would make things worse IMO. 
Lets find a solution that fits to all scenarios.

bq. in the design you propose under the new api, subclassing is impossible, 
which I am not sure I like either.

Hmm, that is not true. You can still subclass and pass your impl up to the 
superclass. I haven't implemented that yet but this is def. possible.

> Supplementary Character Handling in CharTokenizer
> -------------------------------------------------
>
>                 Key: LUCENE-2183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2183
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2183.patch
>
>
> CharTokenizer is an abstract base class for all Tokenizers operating on a 
> character level. Yet, those tokenizers still use char primitives instead of 
> int codepoints. CharTokenizer should operate on codepoints and preserve bw 
> compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

Reply via email to