[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Bruno Roustant (Jira) Thu, 05 Sep 2019 13:47:34 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923428#comment-16923428
 ]


Bruno Roustant edited comment on LUCENE-8920 at 9/5/19 8:46 PM:
----------------------------------------------------------------

[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

On the Binary-Search side, it could be possible to support variable length 
encoding, by finding the first byte starting a label based on the bit used to 
encode the var length additional bytes.


was (Author: bruno.roustant):
[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

It is also possible to combine open-addressing and variable length encoding, by 
finding the first byte starting a label based on the bit used to encode the var 
length additional bytes.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Reply via email to