+1 to configurability that is well documented, and reasonably actionable
downstream in Solr... Some folks struggle with the costs of buying machines
with lots of memory.

On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <[email protected]> wrote:

>
>
>> To me the challenge with such a change is just trying to prevent
>
> strange dictionaries from blowing up to 30x the space :)
>>
>
> Maybe the "backend" could be configurable somehow so that you could change
> the strategy depending on your needs?... I haven't looked at how FSTs are
> used but if can be hidden behind a facade then an alternative
> implementation could be provided depending on one's need?
>
> D.
>
>
>>
>> On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
>> <[email protected]> wrote:
>> >
>> > I was hoping for some numbers :) In the meantime, I've got some of my
>> own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
>> (there's more, but I ignored dialects of the same base language). Together
>> they currently consume a humble 166MB. With one of my less memory-hungry
>> approaches, they'd take ~500MB (maybe less if I optimize, but probably not
>> significantly). Is this very bad or tolerable for, say, 50% speedup?
>> >
>> > I've seen huge *.aff files, and I'm planning to do something with affix
>> FSTs, too. They take some noticeable time, too, but much less than *.dic-s
>> one, so for now I concentrate on *.dic.
>> >
>> > > Sure, but 20% of those linear scans are maybe 7x slower
>> >
>> > Checked that. The distribution appears to be decreasing monotonically.
>> No linear scans are longer than 8, and ~85% of all linear scans end after
>> no more than 1 miss.
>> >
>> > I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
>> complicated :(
>> >
>> > On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <[email protected]> wrote:
>> >>
>> >> Peter, looks like you are way ahead of me :) Thanks for all the work
>> >> you have been doing here, and thanks to Dawid for helping!
>> >>
>> >> You probably know a lot of this code better than me at this point, but
>> >> I remember a couple of these pain points, inline below:
>> >>
>> >> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>> >> <[email protected]> wrote:
>> >> >
>> >> > Hi Robert,
>> >> >
>> >> > Yes, having multiple dictionaries in the same process would increase
>> the memory significantly. Do you have any idea about how many of them
>> people are loading, and how much memory they give to Lucene?
>> >>
>> >> Yeah in many cases, the user is using a server such as solr or
>> elasticsearch.
>> >> Let's use solr as an example, as others are here to correct it, if I
>> am wrong.
>> >>
>> >> Example to understand the challenges: user uses one of solr's 3
>> >> mechanisms to detect language and send to different pipeline:
>> >>
>> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
>> >> Now we know these language detectors are imperfect, if the user maps a
>> >> lot of languages to hunspell pipelines, they may load lots of
>> >> dictionaries, even by just one stray miscategorized document.
>> >> So it doesn't have to be some extreme "enterprise" use-case like
>> >> wikipedia.org, it can happen for a little guy faced with a
>> >> multilingual corpus.
>> >>
>> >> Imagine the user decides to go further, and host solr search in this
>> >> way for a couple local businesses or govt agencies.
>> >> They support many languages and possibly use this detection scheme
>> >> above to try to make language a "non-issue".
>> >> The user may assign each customer a solr "core" (separate index) with
>> >> this configuration.
>> >> Does each solr core load its own HunspellStemFactory? I think it might
>> >> (in isolated classloader), I could be wrong.
>> >>
>> >> For the elasticsearch case, maybe the resource usage in the same case
>> >> is lower, because they reuse dictionaries per-node?
>> >> I think this is how it works, but I honestly can't remember.
>> >> Still the problem remains, easy to end up with dozens of these things
>> in memory.
>> >>
>> >> Also we have the problem that memory usage for a specific can blow up
>> >> in several ways.
>> >> Some languages have bigger .aff file than .dic!
>> >>
>> >> > Thanks for the idea about root arcs. I've done some quick sampling
>> and tracing (for German). 80% of root arc processing time is spent in
>> direct addressing, and the remainder is linear scan (so root acrs don't
>> seem to present major issues). For non-root arcs, ~50% is directly
>> addressed, ~45% linearly-scanned, and the remainder binary-searched.
>> Overall there's about 60% of direct addressing, both in time and invocation
>> counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4
>> inputs are used. Reducing that might increase the number of directly
>> addressed arcs, but I'm not sure that'd speed up much given that time and
>> invocation counts seem to correlate.
>> >> >
>> >>
>> >> Sure, but 20% of those linear scans are maybe 7x slower, its
>> >> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
>> >> Hard to reason about, but maybe worth testing out. It still helps for
>> >> all the other segmenters (japanese, korean) using fst.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Reply via email to