Hi, Given the code is running under Lucene 3.0.1
8<------------------------------------------------------------------------------ import java.io.IOException; import java.io.Reader; import java.io.StringReader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.util.Version; public class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new StopFilter( true, new StandardTokenizer(Version.LUCENE_30, reader), StopAnalyzer.ENGLISH_STOP_WORDS_SET ); } private static void printTokens(String string) throws IOException { TokenStream ts = new MyAnalyzer().tokenStream("default", new StringReader(string)); TermAttribute termAtt = ts.getAttribute(TermAttribute.class); while(ts.incrementToken()) { System.out.print(termAtt.term()); System.out.print(" "); } System.out.println(); } public static void main(String[] args) throws IOException { printTokens("one_two_three"); // prints "one two three" printTokens("four4_five5_six6"); // prints "four4_five5_six6" printTokens("seven7_eight_nine"); // prints "seven7_eight nine" printTokens("ten_eleven11_twelve"); // prints "ten_eleven11_twelve" } } 8<------------------------------------------------------------------------------ I can understand why "one_two_three" and "four4_five5_six6" are tokenized as they are, as this is explained in the StandardTokenizer class header Javadoc. But the other two cases are more subtle and I'm not quite sure I get the idea. If appearance of "7" after "seven" makes it joint token with "eight" but separate to "nine", why is "ten" glued to "eleven11"? Is there any standard and/or easy way to make StandardTokenizer always split on the underscore? Thanks in advance. m. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org