[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888416#comment-16888416
 ] 

Mike Sokolov commented on LUCENE-8920:
--------------------------------------

Before digging in in earnest on FST size reduction, I'd like to tighten up the 
FST.Arc contract. Right now it has all public members and no methods to speak 
of, so the abstraction boundary is not well defined, and in fact we see 
consumers modifying Arc members in a few places outside of the FST class 
itself. This makes it more difficult to reason about the code and make provably 
valid changes. My plan is to do some nonfunctional commits:

1. Add accessors (mostly getters, a few setters will be needed temporarily) to 
Arc, and make all of its members private. It seems as if we often write 
accessors with the same name as the member (rather than the bean standard), so 
I'll go with that.
2. Eliminate the setters; this will require some light refactoring in FSTEnum, 
and a few changes to the memory codec, which keeps a list of Arcs locally and 
updates them for its own purposes.
3. Some refactoring and general cleanup (tightening up access, whitespace 
fixes, etc)

Because that first step is going to touch a lot of files, keep it very strictly 
about introducing the accessors, so there won't be anything beyond changing 
things like `arc.flags` to `arc.flags()`, in a lot of places.

Once these changes are in, the fun can begin again :) I'll add Adrien's 
worst-case test and work on getting the size down for that, pursuing the ideas 
in the description.


> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to