The RAM usage used to be bad as you describe, it blows up way worse for other languages than German. There were many issues :)
For Lucene, one common issue was that users wanted to have a lot of these things in RAM: e.g. supporting many different languages on a single server (multilingual data) and so forth. Can we speed up your use-case by using the FST in a smarter way? Why are there so many traversals... is it the way it is doing inexact matching? decomposition? That was the trick done with stemming, and the stemming was accelerated with some side data structures. For example "patternIndex" thing which is a scary precomputed list of tableized DFAs... its wasting a "little" space with these tables to speed up hotspot for stemming. In that patternIndex example, some assumptions / limits had to be set, that hopefully no dictionary would ever make: that's all the "please report this to [email protected]" checks in the code. some tests were run against all the crazy OO dictionaries out there to examine the memory usage when looking at changes like this. Some of these are really, really crazy and do surprising things. On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov <[email protected]> wrote: > > Hi there, > > I'm mostly done with supporting major Hunspell features necessary for most > european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but > of course I anticipate more minor fixes to come). Thanks Dawid Weiss for > thorough reviews and prompt accepting my PRs so far! > > Now I'd like to make this Hunspell implementation at least as fast as the > native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times > slower for me, depending on the language (I've checked en/de/fr so far). I've > profiled it, done some minor optimizations, and now it appears that most time > is taken by FST traversals. I've prototyped decreasing the number of these > traversals, and the execution time goes down noticeably (e.g. 30%), but it's > still not enough, and the code becomes complicated. > > So I'm considering other data structures instead of FSTs (Hunspell/C++ itself > doesn't bother with tries: it uses hash tables and linear searches instead). > The problem is, FST is very well space-optimized, and other data structures > consume more memory. > > So my question is: what's the relative importance of speed and memory in > Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK to > use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% > improvement in stemming)? Or, with a BytesRefHash plus an array I can make it > ~9MB, with almost the same speedup (but more complex code). > > How much memory usage is acceptable at all? > > Maybe there are other suitable data structures in Lucene core that I'm not > aware of? I basically need a Map<String, Object>, which'd be better queried > with a char[]+offset+length keys (like CharArrayMap does). > > Peter --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
