Re: Hunspell performance

Dawid Weiss Wed, 10 Feb 2021 08:01:53 -0800

> They just seem to need reading/analyzing too many bytes, doing much more
work than a typical hashmap access :)


This is a very tough score to beat... Pretty much any trie structure will
have to descend somehow. FSTs are additionally densely packed in Lucene and
outgoing arc lookup is what's causing the slowdown. An alternative
memory-conservative fst representation is a hash table of [fst-node_id,
out-arc-label] -> fst-node_id. A structure like this one can be made
reasonably compact and path traversals (lookups) are constant-time... at
the price of not being able to enumerate all of node's outgoing arcs. I've
used this approach in the past and it gave a reasonable speedup, although
it's a really specialized data structure.

D.

On Wed, Feb 10, 2021 at 3:43 PM Peter Gromov
<[email protected]> wrote:

> Hi Robert,
>
> Yes, having multiple dictionaries in the same process would increase the
> memory significantly. Do you have any idea about how many of them people
> are loading, and how much memory they give to Lucene?
>
> Yes, I've mentioned I've prototyped "using FST in a smarter way" :)
> Namely, it's possible to cache the arcs/outputs used for searching for
> "electrification" and reuse most of them after an affix is stripped and
> we're now faced with "electrify". This allocates a bit more for each token,
> but gives a noticeable speedup. I'm not entirely happy with the resulting
> code complexity and performance, but I can create a PR.
>
> I'm talking only about plain old affix removal. I have no inexact
> matching. Decomposition basically works like "try to break the word in
> various places and stem them separately, looking at some additional flags".
> For the first word part, some arc/outputs could be reused from initial
> analysis, but for the next ones most likely not. And when I tried the
> aforementioned reusing, the code became so unpleasant that I started
> looking for alternatives :)
>
> One thing I don't like about the arc caching approach is that it looks
> like a dead end: the FST invocation count seems to be already close to
> minimal, and yet its traversal is still very visible in the CPU snapshots.
> And I see no low-hanging fruits in FST internals. They just seem to need
> reading/analyzing too many bytes, doing much more work than a typical
> hashmap access :)
>
> Thanks for the idea about root arcs. I've done some quick sampling and
> tracing (for German). 80% of root arc processing time is spent in direct
> addressing, and the remainder is linear scan (so root acrs don't seem to
> present major issues). For non-root arcs, ~50% is directly addressed, ~45%
> linearly-scanned, and the remainder binary-searched. Overall there's about
> 60% of direct addressing, both in time and invocation counts, which doesn't
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
> that might increase the number of directly addressed arcs, but I'm not sure
> that'd speed up much given that time and invocation counts seem to
> correlate.
>
> Peter
>
> On Wed, Feb 10, 2021 at 2:52 PM Robert Muir <[email protected]> wrote:
>
>> Just throwing out another random idea: if you are doing a lot of FST
>> traversals (e.g. for inexact matching or decomposition), you may end
>> out "hammering" the root arcs of the FST heavily, depending on how the
>> algorithm works. Because root arcs are "busy", they end out being
>> O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
>> are doing "decompounding" too, and have hacks to waste some RAM,
>> ensuring the heavy root arc traversals are O(1):
>>
>> https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
>>
>> Bruno did some FST improvements across the board here, but last time
>> we checked, these hacks were still needed for segmentation usecases
>> like this: see his benchmark here: https://s.apache.org/ffelc
>>
>> For example, maybe it makes sense to cache a few hundred nodes here in
>> a similar way, depending on dictionary's alphabet size, to accelerate
>> segmentation, I don't know if it will help. Maybe also the current FST
>> "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
>> rather than BYTE2 or BYTE4 or whatever it is using now. The current
>> stemming doesn't put much pressure on this, so it isn't optimized.
>>
>> On Wed, Feb 10, 2021 at 7:53 AM Robert Muir <[email protected]> wrote:
>> >
>> > The RAM usage used to be bad as you describe, it blows up way worse
>> > for other languages than German. There were many issues :)
>> >
>> > For Lucene, one common issue was that users wanted to have a lot of
>> > these things in RAM: e.g. supporting many different languages on a
>> > single server (multilingual data) and so forth.
>> > Can we speed up your use-case by using the FST in a smarter way? Why
>> > are there so many traversals... is it the way it is doing inexact
>> > matching? decomposition?
>> >
>> > That was the trick done with stemming, and the stemming was
>> > accelerated with some side data structures. For example "patternIndex"
>> > thing which is a scary precomputed list of tableized DFAs... its
>> > wasting a "little" space with these tables to speed up hotspot for
>> > stemming. In that patternIndex example, some assumptions / limits had
>> > to be set, that hopefully no dictionary would ever make: that's all
>> > the "please report this to [email protected]" checks in the code.
>> > some tests were run against all the crazy OO dictionaries out there to
>> > examine the memory usage when looking at changes like this. Some of
>> > these are really, really crazy and do surprising things.
>> >
>> > On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
>> > <[email protected]> wrote:
>> > >
>> > > Hi there,
>> > >
>> > > I'm mostly done with supporting major Hunspell features necessary for
>> most european languages (
>> https://issues.apache.org/jira/browse/LUCENE-9687) (but of course I
>> anticipate more minor fixes to come). Thanks Dawid Weiss for thorough
>> reviews and prompt accepting my PRs so far!
>> > >
>> > > Now I'd like to make this Hunspell implementation at least as fast as
>> the native Hunspell called via JNI, ideally faster. Now it seems 1.5-3
>> times slower for me, depending on the language (I've checked en/de/fr so
>> far). I've profiled it, done some minor optimizations, and now it appears
>> that most time is taken by FST traversals. I've prototyped decreasing the
>> number of these traversals, and the execution time goes down noticeably
>> (e.g. 30%), but it's still not enough, and the code becomes complicated.
>> > >
>> > > So I'm considering other data structures instead of FSTs
>> (Hunspell/C++ itself doesn't bother with tries: it uses hash tables and
>> linear searches instead). The problem is, FST is very well space-optimized,
>> and other data structures consume more memory.
>> > >
>> > > So my question is: what's the relative importance of speed and memory
>> in Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be
>> OK to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
>> improvement in stemming)? Or, with a BytesRefHash plus an array I can make
>> it ~9MB, with almost the same speedup (but more complex code).
>> > >
>> > > How much memory usage is acceptable at all?
>> > >
>> > > Maybe there are other suitable data structures in Lucene core that
>> I'm not aware of? I basically need a Map<String, Object>, which'd be better
>> queried with a char[]+offset+length keys (like CharArrayMap does).
>> > >
>> > > Peter
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Hunspell performance

Reply via email to