[ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581049#action_12581049 ]
Michael McCandless commented on LUCENE-1241: -------------------------------------------- {quote} we can't handle a string that really contains \uffff {quote} This is an invalid UTF16 string for interchange. The standard explicitly allows for certain characters (including this one) to be used for internal purposes. {quote} However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence. {quote} I don't think this is necessary for assertion. The memory cost for this is sizable. Right now tracking a string's length consumes 2 bytes (0xffff char) per posting. By adding length we're consuming an additional 4 bytes. While indexing, there are a large number of postings (one per unique term) so this added RAM usage is not negligible. I think we should do one or the other, but not both. Really the tradeoff we are exploring here is whether using up 2 more bytes per term, which causes us to flush sooner & merge more often for a given RAM buffer size, is offset by the speedup of not having to check for 0xffff and compute length in certain places. One problem with the patch is you forgot to add another int (4 bytes) POSTING_NUM_BYTE in DocumentsWriter. This is important because the tradeoff we are exploring here is whether increasing RAM usage of a Posting, which causes more frequent flushing, while then saving some of not having to compare to 0xffff in certain places, is net/net a performance "win". Can you fix this? Thanks. Have you run any performance tests to assess the impact of this change? I think that's critical here since if this is net/net a performance loss we shouldn't make the change. > 0xffff char is not a string terminator > -------------------------------------- > > Key: LUCENE-1241 > URL: https://issues.apache.org/jira/browse/LUCENE-1241 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Hiroaki Kawai > Attachments: LUCENE-1241.patch > > > Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but > it should not to be for some reasons. \uffff is not a terminator char itself > and we can't handle a string that really contains \uffff. And also, we can > calculate the end char position in a character sequence from the string > length that we already know. > However, I agree with the usage for assertion, that "\uffff" is placed after > at the end of a string in a char sequence. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]