[ 
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447625#comment-17447625
 ] 

Michael Sokolov commented on LUCENE-10247:
------------------------------------------

As far as testing goes, https://issues.apache.org/jira/browse/LUCENE-8781 has 
some pointers. IIRC that one relied on tests in luceneutil plus Suggester unit 
tests (they do a lot of FST lookups), and there is also a torture test 
(commented out) in TestJapaneseTokenizer using JA wikipedia. That one is 
interesting because the character distribution is quite different from 
euro-centric texts.

> Reduce FST size by using absolute and relative coding for target pointers
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-10247
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10247
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Hendrik Muhs
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is 
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next` 
> flag, it writes a VLong which contains the address of the child. This is an 
> absolute address. A relative address can be shorter, relative address means 
> taking the address of the node and expressing the result of child node as 
> diff between the parent and the child. E.g. if the child is _10099_ and the 
> parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative 
> pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not 
> always true, the absolute pointer could be smaller. Therefore the idea is to 
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is 
> constructed from the leafs to the root, child node addresses are always 
> smaller than parent ones and due to good locality, connected nodes are 
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as 
> link here. It introduces a new bit in _flags_ and changes the compiler to use 
> relative coding if possible. The lookup side needs the current node address 
> and checks one more bit flag. Both additions seem reasonable cheap and do not 
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key 
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you 
> have better reference datasets, let me know. It obviously works better the 
> more data.
> The POC implementation only supports 1 output type (positive int output) and 
> I just made it compile. For this 1 type of dictionary it's possible to create 
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement 
> on whether this is something you like to have. It seems to me that a size 
> reduction of FST's is always welcome and as pointed out, the runtime 
> additions are reasonable small. If you wonder about the implementation, I 
> happily discuss the nitty gritty details on gh, it had some interesting 
> challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to