[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971138#comment-16971138
 ] 

Bruno Roustant commented on LUCENE-8920:
----------------------------------------

I added the expansion credit to the PR#980. This indeed gives more control on 
the oversizing factor and the result is better than I anticipated.

All the previous measures I did were for the worst case for direct addressing. 
This time I also measured the FST size when building from an English dictionary 
of 479K words (including alphanumeric).

I used direct addressing oversizing factor = 1 (ensures no oversizing on 
average)

For English words I measured 217K nodes, only 3.27% nodes (top nodes in the 
FST) are encoded with fixed length arcs, and 99.99% of them with direct 
addressing. Overall FST memory *reduced* by 1.67%.

For worst case I measured 168K nodes, 50% of them are encoded with fixed length 
arcs, and 14% of them with direct encoding. Overall FST memory *reduced* by 
0.8%.

 

I’m confident that with this last PR we will not increase the FST memory 
(compared to without direct addressing). At the same time the top nodes with 
many arcs will most often be encoded with direct addressing. And the worst case 
is controlled so we still ensure no memory increase (while still trying to use 
direct addressing when favorable).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>         Attachments: TestTermsDictRamBytesUsed.java
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to