[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551938#comment-13551938 ]
Robert Muir commented on LUCENE-4682: ------------------------------------- As an experiment i turned off array arcs for kuromoji in my trunk checkout: FST before: [java] 53645 nodes, 253185 arcs, 1535612 bytes... done after: [java] 53645 nodes, 253185 arcs, 1228816 bytes... done JAR before: rw-rw-r- 1 rmuir rmuir 4581420 Jan 12 09:56 lucene-analyzers-kuromoji-4.1-SNAPSHOT.jar after: rw-rw-r- 1 rmuir rmuir 4306792 Jan 12 09:56 lucene-analyzers-kuromoji-5.0-SNAPSHOT.jar > Reduce wasted bytes in FST due to array arcs > -------------------------------------------- > > Key: LUCENE-4682 > URL: https://issues.apache.org/jira/browse/LUCENE-4682 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Michael McCandless > Priority: Minor > Attachments: kuromoji.wasted.bytes.txt > > > When a node is close to the root, or it has many outgoing arcs, the FST > writes the arcs as an array (each arc gets N bytes), so we can e.g. bin > search on lookup. > The problem is N is set to the max(numBytesPerArc), so if you have an outlier > arc e.g. with a big output, you can waste many bytes for all the other arcs > that didn't need so many bytes. > I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size > 1535612 = ~18% wasted. > It would be nice to reduce this. > One thing we could do without packing is: in addNode, if we detect that > number of wasted bytes is above some threshold, then don't do the expansion. > Another thing, if we are packing: we could record stats in the first pass > about which nodes wasted the most, and then in the second pass (paack) we > could set the threshold based on the top X% nodes that waste ... > Another idea is maybe to deref large outputs, so that the numBytesPerArc is > more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org