Re: SpellChecker performance and usage

Doron Cohen Mon, 03 Dec 2007 22:02:00 -0800

I didn't have performance issues when using the spell checker.
Can you describe what you tried and how long it took, so
people can relate to that.


AFAIK the spell checker in o.a.l.search.spell does not "expand
a query by adding all the permutations of potentially misspelled
word". It is based on building an auxiliary index whose *documents*
are *words* of the main index, going through n-gram tokenization.
A checked word is tokenized that way too, and used as a query on.
the auxiliary index.

There's more wisdom in the query tokenization,
but a simplifying example an help to see how it works:
- a misspelled word 'helo' is tokenized as 'he el lo',
- the auxiliary index contains a document for the correct
  word "hello" that was tokenized as 'he el ll lo'
- the score of the document 'hello' would be high when searching
  the auxiliary index for 'he el lo'.

The only performance hit is when refreshing/rebuilding the
auxiliary index after the lexicon of the actual index
has changed a lot. But this can be done in the background when
adequate for the application using Lucene and the spell checker.

Doron

smokey <[EMAIL PROTECTED]> wrote on 03/12/2007 17:23:21:

> My question is for anyone who has experience with Lucene's SpellChecker,
> especially around its performance characteristics/ramifications.
>
> 1. Given the fact that SpellChecker expands a query by adding all the
> permutations of potentially misspelled word, how does it
> perform in general?
>
> 2. How are others handling the case where SpellChecker would NOT perform
> well if you expand the query adding all the permutations? In other words,
> what kind of techniques are people using to get around or alleviate the
> performance hit if any?
>
> Any sharing of information or pointers would be appreciated.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: SpellChecker performance and usage

Reply via email to