[
https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051496#comment-13051496
]
Dawid Weiss commented on LUCENE-3206:
-------------------------------------
I think I know how to compare storing byte[] of UTF8 as compared to
vint-encoded codepoints in UTF32 -- I'll encode the wikipedia terms list in
both ways and we will see what comes out. Theoretically they should be very,
very similar (and full unicode codepoints should generate fewer arcs) because
UTF8 uses an encoding scheme with similar overhead to vint encoding... os if
something is a single-byte sequence in UTF8, will remain single byte vint.
Double-byte UTF8 character will remaing double-byte vint (last double byte
codepoint is 0x7ff=2047, whereas the last double byte vint is 2^14=16384. And
so on. So for text, vint-encoded UTF32 should be more compact than UTF8... The
gain is of course when your "labels" are not text, but arbitrary bytes -- then
byte[] representation would be nicer.
> FST package API refactoring
> ---------------------------
>
> Key: LUCENE-3206
> URL: https://issues.apache.org/jira/browse/LUCENE-3206
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/FSTs
> Affects Versions: 3.2
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: LUCENE-3206.patch
>
>
> The current API is still marked @experimental, so I think there's still time
> to fiddle with it. I've been using the current API for some time and I do
> have some ideas for improvement. This is a placeholder for these -- I'll post
> a patch once I have a working proof of concept.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]