[ 
https://issues.apache.org/jira/browse/LUCENE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3725:
---------------------------------------

    Attachment: LUCENE-3725.patch

Initial patch.... it has tons of nocommits but I think it's basically
working correctly.

The packing is fairly simplistic now, but we can improve it with time
(I know Dawid has done all sorts of cool things!): it chooses the top
N nodes (sorted by incoming arc count) and saves them dereferenced so
that nodes w/ high in-count get a "low" address.  It also saves the
pointer as delta vs current position, if that would take fewer bytes.
The bytes are then in "forward" order.

The size savings varies by FST... eg, for the all-Wikipedia-terms FSA
(no outputs) it reduces byte size by 21%.  If I map to ords (FST) then
it's only 13% (I don't do anything to pack the outputs now, so the
bytes required for them are unchanged).

While the resulting FST is smaller, there is some hit to lookup (~8%
for the Wikipedia ord FST), because we have to deref some nodes.

I only turned packing on for one thing: the Kuromoji FST (shrank by
14%, 272 KB).

                
> Add optional packing to FST building
> ------------------------------------
>
>                 Key: LUCENE-3725
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3725
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3725.patch
>
>
> The FSTs produced by Builder can be further shrunk if you are willing
> to spend highish transient RAM to do so... our Builder today tries
> hard not to use much RAM (and has options to tweak down the RAM usage,
> in exchange for somewhat lager FST), even when building immense FSTs.
> But for apps that can afford highish transient RAM to get a smaller
> net FST, I think we should offer packing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to