: > If a given Tokenizer does not need to do any character normalization (I
: would think most wouldn't) is there any added cost during tokenization with
: this change?
:
: Thank you for your reply, Mike!
: There is no added cost if Tokenizer doesn't need to call correctOffset().
But every tokenizer *should* call correctOffset on the start/end offset of
every token it produces correct?
My understanding is that we would imake a change like this is...
1) change the Tokenizer class to look something like this...
public abstract class Tokenizer extends TokenStream {
protected CharStream input;
protected Tokenizer() {}
protected Tokenizer(Reader input) {
this(new NoOpCharStream(input));
}
protected Tokenizer(CharStream input) {
this.input = input;
}
public void close() throws IOException {
input.close();
}
public void reset(Reader input) throws IOException {
if (input instanceof CharStream) {
this.input = (CharStream)input;
} else {
this.input = new NoOpCharStream(input);
}
}
}
2) change all of the Tokenizers shipped with Lucene to use correctOffset
when setting all start/end offsets on any Tokens.
...once those two things are done, anyone using out-of-the-box tokenizers
can use a CharStream and get correct offsets -- anyone with an existing
custom Tokenizer should continue to get the same behavior as before, but
if they wnat to start using a CharStream they need to tweak there code.
The only potential downside i can think of is the performance cost of the
added method calls -- but if we make NoOpCharStream.correctOffset final
the JVM should be able to able to optimize away the "identity" function
correct?
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]