[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552071#comment-13552071 ]
Robert Muir commented on LUCENE-4682: ------------------------------------- ok i committed the vInt for maxBytesPerArc, but left out the heuristic (so we still have the waste!!!) Here's the comment i added: {code} // TODO: try to avoid wasteful cases: disable doFixedArray in that case /* * * LUCENE-4682: what is a fair heuristic here? * It could involve some of these: * 1. how "busy" the node is: nodeIn.inputCount relative to frontier[0].inputCount? * 2. how much binSearch saves over scan: nodeIn.numArcs * 3. waste: numBytes vs numBytesExpanded * * the one below just looks at #3 if (doFixedArray) { // rough heuristic: make this 1.25 "waste factor" a parameter to the phd ctor???? int numBytes = lastArcStart - startAddress; int numBytesExpanded = maxBytesPerArc * nodeIn.numArcs; if (numBytesExpanded > numBytes*1.25) { doFixedArray = false; } } */ {code} I think it would just be best to do some performance benchmarks and figure this out. I know all the kuromoji waste is at node.depth=1 exactly. Also I indexed all of geonames with this heuristic and it barely changed the FST size: trunk FST: 45296685 packedFST: 39083451 vint maxBytesPerArc: FST: 45052386 packedFST: 39083451 vint maxBytesPerArc+heuristic: FST: 44988400 packedFST: 39029108 So the waste and heuristic doesn't affect all FSTs, only certain ones. > Reduce wasted bytes in FST due to array arcs > -------------------------------------------- > > Key: LUCENE-4682 > URL: https://issues.apache.org/jira/browse/LUCENE-4682 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Michael McCandless > Priority: Minor > Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch > > > When a node is close to the root, or it has many outgoing arcs, the FST > writes the arcs as an array (each arc gets N bytes), so we can e.g. bin > search on lookup. > The problem is N is set to the max(numBytesPerArc), so if you have an outlier > arc e.g. with a big output, you can waste many bytes for all the other arcs > that didn't need so many bytes. > I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size > 1535612 = ~18% wasted. > It would be nice to reduce this. > One thing we could do without packing is: in addNode, if we detect that > number of wasted bytes is above some threshold, then don't do the expansion. > Another thing, if we are packing: we could record stats in the first pass > about which nodes wasted the most, and then in the second pass (paack) we > could set the threshold based on the top X% nodes that waste ... > Another idea is maybe to deref large outputs, so that the numBytesPerArc is > more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org