I'm curious about somethings in the ThaiAnalyzer
It has:
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader)
throws IOException {
if (overridesTokenStreamMethod) {
// LUCENE-1678: force fallback to tokenStream() if we
// have been subclassed and that subclass overrides
// tokenStream but not reusableTokenStream
return tokenStream(fieldName, reader);
}
SavedStreams streams = (SavedStreams) getPreviousTokenStream();
if (streams == null) {
streams = new SavedStreams();
streams.source = new StandardTokenizer(matchVersion, reader);
streams.result = new StandardFilter(streams.source);
streams.result = new ThaiWordFilter(streams.result);
streams.result = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
streams.result,
StopAnalyzer.ENGLISH_STOP_WORDS_SET);
setPreviousTokenStream(streams);
} else {
streams.source.reset(reader);
streams.result.reset(); // reset the ThaiWordFilter's state
}
return streams.result;
}
I'm really curious why reusableTokenStream has the block:
if (overridesTokenStreamMethod) {
// LUCENE-1678: force fallback to tokenStream() if we
// have been subclassed and that subclass overrides
// tokenStream but not reusableTokenStream
return tokenStream(fieldName, reader);
}
but nearly no other Analyzer in contrib has it. (None that I have seen.)
Shouldn't it be in all of them?
And also about:
streams.source.reset(reader);
streams.result.reset(); // reset the ThaiWordFilter's state
This calls reset on everything from the bottom to the top.
Most of the implementations of the class just have
streams.source.reset(reader);
It seems to me that calling streams.source.reset(reader) presumes that the
chain only needs to be reset at the tokenizer.
The documentation for reset() does not indicate that it should always call
super.reset() or input.reset(), which is necessary for chaining back up to the
tokenizer.
If we go to a declarative model for an analyzer, I would think that one would
always want to do both.
-- DM