> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
>
> Added: src/java/org/apache/lucene/analysis NullAnalyzer.java
> NullTokenizer.java
> Log:
> added NullTokenizer/NullAnalyzer which just
> pass through space-separated tokens unmodified (mostly for
> testing purposes)
NullTokenizer is almost exactly like LetterTokenizer, except that instead of
checking for Character.isLetter it checks for !Character.isWhitespace.
Perhaps we should make both of these subclasses of a common base class, with
a protected isTokenChar method that each implements? It's a shame to have
so much code duplication.
We also have LowerCaseTokenizer, which your wrote and which is almost the
same code again. Maybe the base class could also have a normalizeCharacter
method that in LetterTokenizer does nothing but in LowercaseTokenizer calls
toLowercase.
NullTokenizer and NullAnalyzer are also not very descriptive names. I would
prefer WhitespaceTokenizer and WhitespaceAnalyzer. But if these are really
only used by the test code, and the above base-class strategry were
implemented, then these could just become an anonymous classes like:
Analyzer analyzer = new Analyzer() {
public TokenStream tokenStream(Reader reader) {
return new CharTokenizer(reader) {
protected boolean isTokenChar(char c) {
return !Character.isWhitespace(c);
}
}
}
};
That way org.apache.lucene.analysis wouldn't be cluttered by classes not of
general interest.
Do you agree with this proposal? If so, would you like to implement it, or
shall I?
Doug
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>