This looks like a good idea, thanks!

If a given Tokenizer does not need to do any character normalization (I would think most wouldn't) is there any added cost during tokenization with this change?

Mike

Koji Sekiguchi wrote:

I'm working on SOLR-822 and trying to introduce new classes CharStream,
CharReader and CharFilter into Solr:

CharFilter - normalize characters before tokenizer
https://issues.apache.org/jira/browse/SOLR-822

CharFilter(s) will be placed between Reader and Tokenizer:

// CharReader is needed to convert Reader to CharStream
TokenStream stream = new MyTokenFilter( new MyTokenizer(
new MyCharFilter( new CharReader( reader ) ) ) );

and it does character-level filtering like as TokenFilter does
Token-level filtering.

I attached a nice JPEG sample for "character normalization" in SOLR-822.
Please see:

https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

As you can see, if you use CharFilter, Token offsets could be incorrect
because CharFilters may convert 1 char to 2 chars or the other way
around. So, CharFilter has a method "correctOffset()" (CharStream
defines the method as abstract and CharFilter extends CharStream.
See SOLR-822 for the detail) so that Tokenizer can correct token
offsets. But Tokenizer should be "CharStream aware" to call the
method. What do folks feel about introducing CharFilter into Lucene
and changing *all* Tokenizers to "CharStream aware" Tokenizers in
Lucene 2.9/3.0?

Thank you,

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to