[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951033#comment-16951033
 ] 

Bruno Roustant commented on LUCENE-8920:
----------------------------------------

Update about my try with open-addressing.

In fact open-addressing is not compatible with FST contract. Indeed FST not 
only needs a mapping label-to-value, but also requires the ordering of 
label-value entries (for FSTEnum for example to get the next arc). 
Open-addressing does not keep the ordering. Dead-end, I missed that point.

So I'd like to continue helping on this, by leveraging [~jpountz]'s test and 
trying to find a heuristic that takes advantage of direct-addressing while 
limiting the memory overhead.

Medium term I'd like to explore another option: allow binary search on variable 
length list. If we can do this efficiently, maybe we could reduce the memory 
and still have good perf, potentially to buy memory to use more 
direct-addressing.
The idea is to encode differently the VInt/VLong specially for FST key-value 
entries. Currently VInt encoding uses a bit to tell if there is a next 
additional byte. We could use this bit instead to flag the label bytes (let's 
say this bit is 0), and to flag differently the value bytes (this bit is 1). 
Let's call it keyflag bit. So the new VInt-for-FST encoding is a series of 
bytes having the same keyflag bit until this bit changes (and the last byte 
read is ignored). This does not require more memory, and this allows binary 
search: jump to the middle of the whole byte range of the node (indistinctly 
keys and values), there read the keyflag and move next/previous until the first 
byte of a key is found, then read the VInt-for-FST. And so on like a binary 
search.
There is a little byte reading overhead compared to fixed length binary search, 
but it is more compact so maybe more IO cache hit on average. So perf to be 
evaluated.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>         Attachments: TestTermsDictRamBytesUsed.java
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to