[
https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051597#comment-13051597
]
Dawid Weiss commented on LUCENE-3206:
-------------------------------------
I encoded wikipedia termslist in UTF32 (int4) and UTF8 (int1). Interesting
results:
{noformat}
271,461,850 utf32.fst
Arcs: 64.485.082
Nodes: 36.624.613
270,137,939 utf8.fst
Arcs: 66.478.193
Nodes: 38.687.637
{noformat}
So... the files are pretty much the same size... UTF32 is slighly bigger, but
(as predicted) it has fewer arcs and fewer nodes. I checked and ALL input UTF8
strings are the same or longer than vint-coded UTF32 sequences... So how come
UTF32 automaton is larger? I have no clue -- I assume it may be something with
the size of v-coded pointers... but I have no clue. In any case, the size gain
from using int1 to encode UTF8 is minimal over just using full unicode
codepoints and v-coded int4. Performance-wise it may be a hit (because one
would need to convert UTF8/UTF16 to full unicode codepoints), but size-wise it
seems to be relatively the same.
> FST package API refactoring
> ---------------------------
>
> Key: LUCENE-3206
> URL: https://issues.apache.org/jira/browse/LUCENE-3206
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/FSTs
> Affects Versions: 3.2
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: LUCENE-3206.patch
>
>
> The current API is still marked @experimental, so I think there's still time
> to fiddle with it. I've been using the current API for some time and I do
> have some ideas for improvement. This is a placeholder for these -- I'll post
> a patch once I have a working proof of concept.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]