Re: Proposal for introducing CharFilter

Michael McCandless Tue, 11 Nov 2008 14:47:21 -0800


This looks like a good idea, thanks!

If a given Tokenizer does not need to do any character normalization(I would think most wouldn't) is there any added cost duringtokenization with this change?


Mike

Koji Sekiguchi wrote:

I'm working on SOLR-822 and trying to introduce new classesCharStream,

CharReader and CharFilter into Solr:

CharFilter - normalize characters before tokenizer
https://issues.apache.org/jira/browse/SOLR-822

CharFilter(s) will be placed between Reader and Tokenizer:

// CharReader is needed to convert Reader to CharStream
TokenStream stream = new MyTokenFilter( new MyTokenizer(
new MyCharFilter( new CharReader( reader ) ) ) );

and it does character-level filtering like as TokenFilter does
Token-level filtering.

I attached a nice JPEG sample for "character normalization" inSOLR-822.

Please see:

https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

As you can see, if you use CharFilter, Token offsets could beincorrect

because CharFilters may convert 1 char to 2 chars or the other way
around. So, CharFilter has a method "correctOffset()" (CharStream
defines the method as abstract and CharFilter extends CharStream.
See SOLR-822 for the detail) so that Tokenizer can correct token
offsets. But Tokenizer should be "CharStream aware" to call the
method. What do folks feel about introducing CharFilter into Lucene
and changing *all* Tokenizers to "CharStream aware" Tokenizers in
Lucene 2.9/3.0?

Thank you,

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal for introducing CharFilter

Reply via email to