[ 
https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581049#action_12581049
 ] 

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
we can't handle a string that really contains \uffff
{quote}
This is an invalid UTF16 string for interchange.  The standard explicitly 
allows for certain characters (including this one) to be used for internal 
purposes.

{quote}
However, I agree with the usage for assertion, that "\uffff" is placed after at 
the end of a string in a char sequence.
{quote}
I don't think this is necessary for assertion.  The memory cost for this is 
sizable.  Right now tracking a string's length consumes 2 bytes (0xffff char) 
per posting.  By adding length we're consuming an additional 4 bytes.  While 
indexing, there are a large number of postings (one per unique term) so this 
added RAM usage is not negligible.

I think we should do one or the other, but not both.

Really the tradeoff we are exploring here is whether using up 2 more bytes per 
term, which causes us to flush sooner & merge more often for a given RAM buffer 
size, is offset by the speedup of not having to check for 0xffff and compute 
length in certain places.

One problem with the patch is you forgot to add another int (4 bytes) 
POSTING_NUM_BYTE in DocumentsWriter.  This is important because the tradeoff we 
are exploring here is whether increasing RAM usage of a Posting, which causes 
more frequent flushing, while then saving some of not having to compare to 
0xffff in certain places, is net/net a performance "win".  Can you fix this?  
Thanks.

Have you run any performance tests to assess the impact of this change?  I 
think that's critical here since if this is net/net a performance loss we 
shouldn't make the change.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but 
> it should not to be for some reasons. \uffff is not a terminator char itself 
> and we can't handle a string that really contains \uffff. And also, we can 
> calculate the end char position in a character sequence from the string 
> length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after 
> at the end of a string in a char sequence.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to