Re: Preserving dots of an acronym while indexing in Lucene

Shai Erera Sat, 18 Jul 2009 23:02:40 -0700

I think you should write your own Analyzer and use:
* StandardTokenizer for tokenization and ACRONYM detection.
* StopFilter for stopwrods handling.

The Analyzer you write should override tokenStream() and do something like:

************************************************************
TokenStream result = new StandardTokenizer(reader);
result = new LowerCaseFilter(result); // if lower casing is also what you
want.
result = new StopFilter(result, stopWords);
return result;
************************************************************

StandardAnalyzer wraps StandardTokenizer with StandardFilter, which strips
the acronym off its '.', so you don't want to use it.

Shai

On Sun, Jul 19, 2009 at 8:53 AM, mitu2009 <[email protected]> wrote:

>
> Hi,
>
> If i want Lucene to preserve dots of acronyms(example: U.K,U.S.A. etc),
> which analyzer do i need to use and how? I also want to input a set of stop
> words to Lucene while doing this.
>
> --
> View this message in context:
> http://www.nabble.com/Preserving-dots-of-an-acronym-while-indexing-in-Lucene-tp24554342p24554342.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Preserving dots of an acronym while indexing in Lucene

Reply via email to