Adrien Grand commented on LUCENE-8920:

bq. If we set default oversizing factor 1, we will effectively reduce the FST 
size, but we will have less improvement on perf.

Out of curiosity, have you confirmed this? My intuition is that this new 
encoding is going to be used anyway on dense nodes because of how it tends to 
be more space-efficient on dense nodes. I worry that values greater than 1 
might mostly make this new encoding used on nodes that don't have that many 
arcs, where using binary search or direct addressing doesn't matter much, so it 
wouldn't be a great use of the memory overhead?

bq. I tried to get more visible improvement by accepting some memory increase. 
It seems what you're trying to achieve is same FST memory with some perf 

I do indeed care about memory usage because I know of several users who already 
have gigabytes of memory spent on the terms index per node, so even something 
like a 10% increase could translate to hundreds of megabytes. There is 
definitely a speed/memory trade-off here and I could see value in spending a 
bit more memory on the terms index to achieve greater speed, but if you want to 
speed up access to the terms dictionary, it's unclear to me whether increasing 
this oversizing ratio is a better way to spend memory than e.g. decreasing the 
min/max block sizes in BlockTreeTermsWriter in order to be able to more often 
figure out that a terms doesn't exist without going to disk.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>         Attachments: TestTermsDictRamBytesUsed.java
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?

This message was sent by Atlassian Jira

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to