Steven A Rowe wrote:
On 03/11/2008 at 8:46 AM, André Warnier wrote:
João Rodrigues wrote:
@André:

Even if I use Simple Analyzer, which I think should leave the term
"alone", the number gets "eaten".
I'm no expert, so I was just launching that answer to see if it elicited
more qualified responses. But I found this on Google :
http://project.iml.umu.se/projects/scam-repository/ticket/2 (seems to
say also that SimpleAnalyser does not retain numbers, and that you
should try StandardAnalyser instead).

(But I must say that precise documentation seems hard to find).

The API docs are at: <http://lucene.apache.org/java/2_3_1/api/>.  Find the 
class name you're interested in and follow it where it goes :) .

SimpleAnalyzer is "[a]n Analyzer that filters LetterTokenizer with 
LowerCaseFilter":

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/SimpleAnalyzer.html>

LetterTokenizer's docs say:

   A LetterTokenizer is a tokenizer that divides text at non-letters.
   That's to say, it defines tokens as maximal strings of adjacent
   letters, as defined by java.lang.Character.isLetter() predicate.

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LetterTokenizer.html>

LowercaseFilter "[n]ormalizes token text to lower case":

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LowerCaseFilter.html>

Exercise for the reader: find the docs for StandardAnalyzer :) .

http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html
Clever, that.

Thanks for the information above. Since I am myself trying to learn Lucene, this discussion comes in handy.

In other words, SimpleAnalyzer is also not the right tool to use for indexing acronyms such as "P2P" or "W3C"..


For the casual user, the practical problem is that in the doc, in the paragraph
"An Analyzer that filters LetterTokenizer with LowerCaseFilter."
the words "LetterTokenizer" and "LowerCaseFilter" are not themselves links to the corresponding classes (docs). So the casual user has no idea where in the hierarchy to go and look for those. The StandardAnalyser is a case in point. Moving up the class hierarchy doesn't help that much, since one quickly ends at Object. Entering the URL "http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/"; yelds a long list of things, among which the ones looked for, but it can hardly be considered user-friendly.

Now here is a brilliant idea : why not create a public Lucene site, where the docs would be indexed with... Lucene ?
Or is there already such a thing ?

André

Reply via email to