[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

Michael McCandless (JIRA) Tue, 25 Mar 2008 02:25:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581843#action_12581843
 ]


Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
I think we should not use \uffff as a terminator in Lucene library regardless 
of the fact that it is allowed in Unicode standard, because it is unnecessary.
{quote}

I'm not yet convinced it's unecessary.  We need to run performance
tests to understand the time/space tradeoff here.  If this change
speeds up indexing we should do it.  RAM is cheap.

By far, the Posting instances consume the most RAM in DocumentsWriter.
Right now each Posting is 66 bytes; this patch, once finished
increases that to 68 bytes.

I don't like increasing the byte usage of Posting unless there's a
good counterbalance, which I think this change *may* have if we see
that it improves indexing speed.

I just checked: when indexing Wikipedia with a 64 MB buffer, each
segment flushed has ~430,000 Posting instances.  So the Posting
instances alone account for 27 MB of the buffer.

That means the added 2 bytes from this change will consume ~840 KB
additional RAM, which is not insignificant loss of RAM efficiency.

[Aside: by Zipf's law, the vast majority of these terms should occur
rarely.  Eg roughly half will occur only once.  If we could find some
way to represent these rare terms with a much more compact structure
(Posting has alot of "overhead" to efficiently manage a long posting
list) then we would greatly increase DW's RAM efficiency.]




> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but 
> it should not to be for some reasons. \uffff is not a terminator char itself 
> and we can't handle a string that really contains \uffff. And also, we can 
> calculate the end char position in a character sequence from the string 
> length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after 
> at the end of a string in a char sequence.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

Reply via email to