2011/2/7 Robert Muir <rcm...@gmail.com> > On Sun, Feb 6, 2011 at 3:28 PM, Georger Araujo <georger.ara...@gmail.com> > wrote: > > Hi, > > I started using Lucene a few weeks ago, and I must say I'm amazed. Hats > off > > to the developers and the community! > > I'd like to write a custom analyzer whose only difference to > > org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to > discard > > numeric tokens from the input. I've looked at the code and also at the > > discussion in [1], but I'm lost about what is the simplest/cleanest way > to > > go. > > What do you think? > > Hi, in general the supplied analyzers are basically very general > purpose examples. > > So i would make your own analyzer: except using a tokenizer that > discards numbers (like lowercasetokenizer) instead of > standardtokenizer: something like LowerCaseTokenizer + > BrazilianStemFilter + Brazilian stopwords in a stopfilter. > > > Hi, I investigated this issue further and found out that StandardTokenizer is actually desirable for my needs - I need to index emails, acronyms, etc. So I'll use package org.apache.lucene.analysis.StopFilter as a starting point to try and write a custom TokenFilter to discard numbers, then just extend BrazilianAnalyzer and use this custom TokenFilter as the final filter in the chain. I believe the end result will be simpler and cleaner this way. Best regards,
Georger