[
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924203#comment-16924203
]
Mike Sokolov commented on LUCENE-8920:
--------------------------------------
If I understand you correctly, T1 is the threshold we introduced earlier this
year (or its inverse DIRECT_ARC_LOAD_FACTOR in fst.Builder). It's currently set
to 4, or (1/4 as T1 in your formulation). There was pre-existing logic to
decide (var-encoded) list vs. the (fixed-size, packed) array encoding; my
change was piggy-backed on that. It's a threshold on N that depends on the
depth in the FST. See FST.shouldExpand.
If you want to write up the open addressing idea in more detail, it's fine to
add comments here unless you think they are too long / inconvenient to write in
this form, then maybe attach a doc? I think that goes directly to the point of
reducing space consumption, so this issue seems like a fine place for it.
> Reduce size of FSTs due to use of direct-addressing encoding
> -------------------------------------------------------------
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Mike Sokolov
> Priority: Blocker
> Fix For: 8.3
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization.
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance,
> the size increase we're seeing while building (or perhaps do a preliminary
> pass before building) in order to decide whether to apply the encoding.
> bq. we could also make the encoding a bit more efficient. For instance I
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes)
> which make gaps very costly. Associating each label with a dense id and
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset
> instead of doing label->arc directly could save a lot of space in some cases?
> Also it seems that we are repeating the label in the arc metadata when
> array-with-gaps is used, even though it shouldn't be necessary since the
> label is implicit from the address?
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]