[
https://issues.apache.org/jira/browse/LUCENE-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187522#comment-13187522
]
Dawid Weiss commented on LUCENE-3699:
-------------------------------------
If it's something that is statically compiled (in batch mode) then one could
reorder states (nodes) to minimize vlength of arc pointers globally. This is
something I did for fst5 automata and it worked very nice (because the
distribution of in-node degrees is exponential-like so moving a few nodes with
many in-links decreases the global automaton size in a significant way).
I don't think there is any fast algorithm to do this. I used a simple
heuristic: calculate in-link degree for each state, sort in descending order,
then re-order N top-most nodes so that they're at the front of the serialized
automaton. Pick N using any heuristic you like (constant, in-link cutoff, I
used a sort of simulated annealing approach and probed around).
The presentation about the paper in question is here:
http://ciaa-fsmnlp-2011.univ-tours.fr/ciaa/upload/files/Weiss-Daciuk.pdf
I can't publish the PDF of the paper publicly (Springer below), but I can send
a PDF copy if somebody is interested. The concept should be clear without the
paper anyway :)
http://www.springerlink.com/content/60r47952k610l822/
> kuromoji dictionary could be more compact
> -----------------------------------------
>
> Key: LUCENE-3699
> URL: https://issues.apache.org/jira/browse/LUCENE-3699
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Robert Muir
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3699.patch, LUCENE-3699_more.patch
>
>
> Reading thru the ipadic documentation, i realized we are storing a lot of
> redundant information,
> for example the connection costs for bigram weights are based on
> POS+inflection data, so its redundant
> to also separately encode POS and inflection data for each entry.
> With the patch the dictionary access is also faster and simpler, and
> TokenInfoDictionary is 1.5MB smaller.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]