On 03/11/2008 at 8:46 AM, André Warnier wrote: > João Rodrigues wrote: > > @André: > > > > Even if I use Simple Analyzer, which I think should leave the term > > "alone", the number gets "eaten". > > I'm no expert, so I was just launching that answer to see if it elicited > more qualified responses. But I found this on Google : > http://project.iml.umu.se/projects/scam-repository/ticket/2 (seems to > say also that SimpleAnalyser does not retain numbers, and that you > should try StandardAnalyser instead). > > (But I must say that precise documentation seems hard to find).
The API docs are at: <http://lucene.apache.org/java/2_3_1/api/>. Find the class name you're interested in and follow it where it goes :) . SimpleAnalyzer is "[a]n Analyzer that filters LetterTokenizer with LowerCaseFilter": <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/SimpleAnalyzer.html> LetterTokenizer's docs say: A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LetterTokenizer.html> LowercaseFilter "[n]ormalizes token text to lower case": <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LowerCaseFilter.html> Exercise for the reader: find the docs for StandardAnalyzer :) . Steve
