[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552624#comment-13552624 ]
Dawid Weiss commented on LUCENE-4682: ------------------------------------- bq. I also "tested" delta-coding the arc target instead of the abs vInt we have today ... I did such experiments when I was working on that paper. Remember -- you don't publish negative results, unfortunately. Obviously I didn't try everything but: 1) NEXT was important, 2) the structure of the FST doesn't yield to easy local deltas; it's not easily separable and pointers typically jump all over. bq. Which is surprising ... I guess we don't see much "locality" for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Not really that surprising -- you encode common suffixes after all so most of them will appear in a properly sized sample. This is actually why the strategy of moving nodes around works too -- you move those that are super frequent but they'll most likely be reordered at the "top" suffix frequencies of the automaton anyway. > Reduce wasted bytes in FST due to array arcs > -------------------------------------------- > > Key: LUCENE-4682 > URL: https://issues.apache.org/jira/browse/LUCENE-4682 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Michael McCandless > Priority: Minor > Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch > > > When a node is close to the root, or it has many outgoing arcs, the FST > writes the arcs as an array (each arc gets N bytes), so we can e.g. bin > search on lookup. > The problem is N is set to the max(numBytesPerArc), so if you have an outlier > arc e.g. with a big output, you can waste many bytes for all the other arcs > that didn't need so many bytes. > I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size > 1535612 = ~18% wasted. > It would be nice to reduce this. > One thing we could do without packing is: in addNode, if we detect that > number of wasted bytes is above some threshold, then don't do the expansion. > Another thing, if we are packing: we could record stats in the first pass > about which nodes wasted the most, and then in the second pass (paack) we > could set the threshold based on the top X% nodes that waste ... > Another idea is maybe to deref large outputs, so that the numBytesPerArc is > more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org