> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
>
>   Added:       src/java/org/apache/lucene/analysis NullAnalyzer.java
>                         NullTokenizer.java
>   Log:
>   added NullTokenizer/NullAnalyzer which just 
> pass through space-separated tokens unmodified (mostly for 
> testing purposes)

NullTokenizer is almost exactly like LetterTokenizer, except that instead of
checking for Character.isLetter it checks for !Character.isWhitespace.
Perhaps we should make both of these subclasses of a common base class, with
a protected isTokenChar method that each implements?  It's a shame to have
so much code duplication.

We also have LowerCaseTokenizer, which your wrote and which is almost the
same code again.  Maybe the base class could also have a normalizeCharacter
method that in LetterTokenizer does nothing but in LowercaseTokenizer calls
toLowercase.

NullTokenizer and NullAnalyzer are also not very descriptive names.  I would
prefer WhitespaceTokenizer and WhitespaceAnalyzer.  But if these are really
only used by the test code, and the above base-class strategry were
implemented, then these could just become an anonymous classes like:
  Analyzer analyzer = new Analyzer() {
    public TokenStream tokenStream(Reader reader) {
      return new CharTokenizer(reader) {
        protected boolean isTokenChar(char c) {
          return !Character.isWhitespace(c);
        }
      }
    }
  };
That way org.apache.lucene.analysis wouldn't be cluttered by classes not of
general interest.

Do you agree with this proposal?  If so, would you like to implement it, or
shall I?

Doug

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to