[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

Michael McCandless (JIRA) Mon, 14 Jan 2013 05:36:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552653#comment-13552653
 ]


Michael McCandless commented on LUCENE-4682:
--------------------------------------------

bq. I mean in general NEXT/reversing adds more complexity here which makes it 
harder to try different things? Like a big doberman and BEWARE sign scaring off 
developers 

LOL :)

But yeah I agree.

bq. Its a big part of what sent me over the edge trying to refactor FST to have 
a store abstraction (LUCENE-4593). But fortunately you did that anyway...

Right but it's not good if bus factor is 1 ... it's effectively dead code when 
that happens.

bq. It would be really really really good for FSTs long term to do things like 
remove reversing, remove packed (fold these optos or at least most of them in 
by default), etc.

+1, except that NEXT buys us a much smaller FST now.  We can't just drop it ... 
we need some way to simplify it (eg somehow stop reversing).
                
> Reduce wasted bytes in FST due to array arcs
> --------------------------------------------
>
>                 Key: LUCENE-4682
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4682
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch
>
>
> When a node is close to the root, or it has many outgoing arcs, the FST 
> writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
> search on lookup.
> The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
> arc e.g. with a big output, you can waste many bytes for all the other arcs 
> that didn't need so many bytes.
> I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
> 1535612 = ~18% wasted.
> It would be nice to reduce this.
> One thing we could do without packing is: in addNode, if we detect that 
> number of wasted bytes is above some threshold, then don't do the expansion.
> Another thing, if we are packing: we could record stats in the first pass 
> about which nodes wasted the most, and then in the second pass (paack) we 
> could set the threshold based on the top X% nodes that waste ...
> Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
> more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

Reply via email to