DirectSpellChecker.suggestSimilar() scans TermEnum. but why?

Mikhail Khludnev Sat, 29 Dec 2012 06:59:44 -0800

Happy New Year, Devs!

Excuse me for the noob's question. I'm not able to get deep into FST
internals. I run trivial benchmark and not really enjoyed by the results.


I'm looking for the ultra-fast spelling correction. Right now I use 3.x
SpellChecker which is backed on separate Lucene Ngram index.FWIW, it's
persistent, not in RAMDirectory. Now the bottleneck is I/O. Reading that
Lucene Ngram index takes too much time. I guess it might be solved by
loading Lucene Ngram index into RAMDirectory, but I want to exploit FST
spell check from 4.0.

What I see, and what makes me wonder. Every
DirectSpellChecker.suggestSimilar() creates new FuzzyTermsEnum and every
time it scans the termsEnum by FilteredTermsEnum.next(). And here I hit the
same slow IO bummer. It might be necessary detail: I read 3.x index by 4.0
code. I don't think it changes something.

I don't know anything about FST, but I've thought that it's a compact graph
of syllables, which is visited for finding string similar to the given i.e.
I expect it won't scan termsEnum for every lookup.

Please tell me what's wrong in my expectations. Thanks!

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <[email protected]>

DirectSpellChecker.suggestSimilar() scans TermEnum. but why?

Reply via email to