> To me the challenge with such a change is just trying to prevent
strange dictionaries from blowing up to 30x the space :)

Checked that. And indeed, such dictionaries exist, 20x and even 30x, and
then they start taking up to 30MB. Not nice.

> if you assume zipfian distribution of words, the top common ones could be
stored/ cached outside of the fst (even in an associative dictionary). This
would require external frequency information during construction but this
isn't something difficult.

That's an interesting idea about storing shortest (easy) or most frequent
(hard) words separately. Unfortunately the distribution isn't entirely
zipfian, as the dictionaries tend to contain a lot of short but uncommon
words, like abbreviations. Still, something for me to explore, thanks!

As for configurability, I'm considering that option as well (but would
prefer to avoid it). It's not as easy as adding a facade around one
"lookup" method. Well, now it is, but if we're staying with FST, then I'd
better finish that arc caching optimization I described (some 20-30%
speedup, not bad), and that'd require changing multiple signatures to pass
around some arc cache info in addition to the simple char[].

On Thu, Feb 11, 2021 at 9:09 AM Dawid Weiss <[email protected]> wrote:

>
> I peeked at the code and I still think it's not a bad idea to experiment
> with extracting a facade for construction and lookup of words. there may
> even be a middle ground between size and speed - if you assume zipfian
> distribution of words, the top common ones could be stored/ cached outside
> of the fst (even in an associative dictionary). This would require external
> frequency information during construction but this isn't something
> difficult.
>
> D.
>
> On Thu, Feb 11, 2021 at 8:54 AM Dawid Weiss <[email protected]> wrote:
>
>>
>> I didn't mean for Peter to write both backends but perhaps, if he's
>> experimenting already anyway, make it possible to extract an interface
>> which could be substituted externally with different implementations. Makes
>> it easier to tinker with various options, even for us.
>>
>> D.
>>
>> On Thu, Feb 11, 2021 at 1:16 AM Robert Muir <[email protected]> wrote:
>>
>>> On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <[email protected]>
>>> wrote:
>>> > Maybe the "backend" could be configurable somehow so that you could
>>> change the strategy depending on your needs?... I haven't looked at how
>>> FSTs are used but if can be hidden behind a facade then an alternative
>>> implementation could be provided depending on one's need?
>>> >
>>> > D.
>>> >
>>>
>>> I don't have any confidence that solr would default to the "smaller"
>>> option or fix how they manage different solr cores or thousands of
>>> threads or any of the analyzer issues. And who would maintain this
>>> separate hunspell backend? I don't think it is fair to Peter to have
>>> to cope with 2 implementations of hunspell, 1 is certainly enough...
>>> :). It's all apache license, at the end of the day if someone wants to
>>> step up, let 'em. otherwise let's get out of their way.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Reply via email to