Hi there, I'm mostly done with supporting major Hunspell features necessary for most european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but of course I anticipate more minor fixes to come). Thanks Dawid Weiss for thorough reviews and prompt accepting my PRs so far!
Now I'd like to make this Hunspell implementation at least as fast as the native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times slower for me, depending on the language (I've checked en/de/fr so far). I've profiled it, done some minor optimizations, and now it appears that most time is taken by FST traversals. I've prototyped decreasing the number of these traversals, and the execution time goes down noticeably (e.g. 30%), but it's still not enough, and the code becomes complicated. So I'm considering other data structures instead of FSTs (Hunspell/C++ itself doesn't bother with tries: it uses hash tables and linear searches instead). The problem is, FST is very well space-optimized, and other data structures consume more memory. So my question is: what's the relative importance of speed and memory in Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% improvement in stemming)? Or, with a BytesRefHash plus an array I can make it ~9MB, with almost the same speedup (but more complex code). How much memory usage is acceptable at all? Maybe there are other suitable data structures in Lucene core that I'm not aware of? I basically need a Map<String, Object>, which'd be better queried with a char[]+offset+length keys (like CharArrayMap does). Peter