Hi there,

I'm mostly done with supporting major Hunspell features necessary for most
european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but
of course I anticipate more minor fixes to come). Thanks Dawid Weiss for
thorough reviews and prompt accepting my PRs so far!

Now I'd like to make this Hunspell implementation at least as fast as the
native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times
slower for me, depending on the language (I've checked en/de/fr so far).
I've profiled it, done some minor optimizations, and now it appears that
most time is taken by FST traversals. I've prototyped decreasing the number
of these traversals, and the execution time goes down noticeably (e.g.
30%), but it's still not enough, and the code becomes complicated.

So I'm considering other data structures instead of FSTs (Hunspell/C++
itself doesn't bother with tries: it uses hash tables and linear searches
instead). The problem is, FST is very well space-optimized, and other data
structures consume more memory.

So my question is: what's the relative importance of speed and memory in
Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK
to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
improvement in stemming)? Or, with a BytesRefHash plus an array I can make
it ~9MB, with almost the same speedup (but more complex code).

How much memory usage is acceptable at all?

Maybe there are other suitable data structures in Lucene core that I'm not
aware of? I basically need a Map<String, Object>, which'd be better queried
with a char[]+offset+length keys (like CharArrayMap does).

Peter

Reply via email to