[jira] [Commented] (LUCENE-3206) FST package API refactoring

Dawid Weiss (JIRA) Sat, 18 Jun 2011 03:43:06 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051496#comment-13051496
 ]


Dawid Weiss commented on LUCENE-3206:
-------------------------------------

I think I know how to compare storing byte[] of UTF8 as compared to 
vint-encoded codepoints in UTF32 -- I'll encode the wikipedia terms list in 
both ways and we will see what comes out. Theoretically they should be very, 
very similar (and full unicode codepoints should generate fewer arcs) because 
UTF8 uses an encoding scheme with similar overhead to vint encoding... os if 
something is a single-byte sequence in UTF8, will remain single byte vint. 
Double-byte UTF8 character will remaing double-byte vint (last double byte 
codepoint is 0x7ff=2047, whereas the last double byte vint is 2^14=16384. And 
so on. So for text, vint-encoded UTF32 should be more compact than UTF8... The 
gain is of course when your "labels" are not text, but arbitrary bytes -- then 
byte[] representation would be nicer.



> FST package API refactoring
> ---------------------------
>
>                 Key: LUCENE-3206
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3206
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 3.2
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 3.3, 4.0
>
>         Attachments: LUCENE-3206.patch
>
>
> The current API is still marked @experimental, so I think there's still time 
> to fiddle with it. I've been using the current API for some time and I do 
> have some ideas for improvement. This is a placeholder for these -- I'll post 
> a patch once I have a working proof of concept.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3206) FST package API refactoring

Reply via email to