[
https://issues.apache.org/jira/browse/LUCENE-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187532#comment-13187532
]
Robert Muir commented on LUCENE-3699:
-------------------------------------
Dawid, currently the FST is not really the biggest culprit:
{noformat}
-rw-r--r-- 1 rmuir staff 65568 Jan 16 16:35 CharacterDefinition.dat
-rw-r--r-- 1 rmuir staff 2624540 Jan 16 16:35 ConnectionCosts.dat
-rw-r--r-- 1 rmuir staff 4337216 Jan 17 03:22 TokenInfoDictionary$buffer.dat
-rw-r--r-- 1 rmuir staff 1954846 Jan 16 16:35 TokenInfoDictionary$fst.dat
-rw-r--r-- 1 rmuir staff 54870 Jan 16 16:35
TokenInfoDictionary$posDict.dat
-rw-r--r-- 1 rmuir staff 392165 Jan 17 03:22
TokenInfoDictionary$targetMap.dat
-rw-r--r-- 1 rmuir staff 311 Jan 17 03:22 UnknownDictionary$buffer.dat
-rw-r--r-- 1 rmuir staff 4111 Jan 16 16:35 UnknownDictionary$posDict.dat
-rw-r--r-- 1 rmuir staff 69 Jan 16 16:35
UnknownDictionary$targetMap.dat
{noformat}
as far as the FST, our output is just an increasing ord (according to term sort
order),
so I think it should be pretty good? Is there something more efficient than
this?
Basically there are about 330k headwords, and 390k words. so some words have
different
parts of speech/reading etc for the same surface form.
The $fst.dat is currently FST<int> where int is just an ord into
$targetMap.dat, which is
really a int[][] (it maps the output ord from the fst into an int[] containing
the offsets
of all word entries for that surface form).
But the 'meat' describing the entries is in $buffer.dat. for each word this is
its cost,
part of speech, base form (stem), reading, pronunciation, etc, etc. As you see
we
are down to about 11 bytes per lemma on average, but still this 'metadata' is
the biggest,
thats what i was working on shrinking in this issue.
> kuromoji dictionary could be more compact
> -----------------------------------------
>
> Key: LUCENE-3699
> URL: https://issues.apache.org/jira/browse/LUCENE-3699
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Robert Muir
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3699.patch, LUCENE-3699_more.patch
>
>
> Reading thru the ipadic documentation, i realized we are storing a lot of
> redundant information,
> for example the connection costs for bigram weights are based on
> POS+inflection data, so its redundant
> to also separately encode POS and inflection data for each entry.
> With the patch the dictionary access is also faster and simpler, and
> TokenInfoDictionary is 1.5MB smaller.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]