I'm working on SOLR-822 and trying to introduce new classes CharStream, CharReader and CharFilter into Solr:
CharFilter - normalize characters before tokenizer https://issues.apache.org/jira/browse/SOLR-822 CharFilter(s) will be placed between Reader and Tokenizer: // CharReader is needed to convert Reader to CharStream TokenStream stream = new MyTokenFilter( new MyTokenizer( new MyCharFilter( new CharReader( reader ) ) ) ); and it does character-level filtering like as TokenFilter does Token-level filtering. I attached a nice JPEG sample for "character normalization" in SOLR-822. Please see: https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG As you can see, if you use CharFilter, Token offsets could be incorrect because CharFilters may convert 1 char to 2 chars or the other way around. So, CharFilter has a method "correctOffset()" (CharStream defines the method as abstract and CharFilter extends CharStream. See SOLR-822 for the detail) so that Tokenizer can correct token offsets. But Tokenizer should be "CharStream aware" to call the method. What do folks feel about introducing CharFilter into Lucene and changing *all* Tokenizers to "CharStream aware" Tokenizers in Lucene 2.9/3.0? Thank you, Koji --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]