[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923428#comment-16923428 ]
Bruno Roustant commented on LUCENE-8920: ---------------------------------------- [~sokolov] There may be another option to speed-up FST arc lookup while limiting the memory increase. Direct-Addressing option looks up by accessing directly 1 label, and costs up to (num labels x 4 x num bytes to encode) bytes. Label-List option is the opposite, look up needs on average N/2 label comparisons, and costs (num labels x var bytes to encode) bytes. Another option is to use open-addressing. Look up would be <= L comparisons where we can fix L < log(N)/2 (to be faster than binary search), and would cost < (num labels x 2 x num bytes to encode). The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the labels and store them in the array using the open-addressing idea: if a slot is occupied, try with the next block. If we can’t store a label in less than L tries, then abort and fallback to Label-List or Binary-Search option. At lookup we hash the input label and know that we have less than L tries to compare. This is another compromise speed/memory: faster than binary search (constant L), with at least 2x less memory than Direct-Addressing. It is also possible to combine open-addressing and variable length encoding, by finding the first byte starting a label based on the bit used to encode the var length additional bytes. > Reduce size of FSTs due to use of direct-addressing encoding > ------------------------------------------------------------- > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Mike Sokolov > Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org