[ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581843#action_12581843 ]
Michael McCandless commented on LUCENE-1241: -------------------------------------------- {quote} I think we should not use \uffff as a terminator in Lucene library regardless of the fact that it is allowed in Unicode standard, because it is unnecessary. {quote} I'm not yet convinced it's unecessary. We need to run performance tests to understand the time/space tradeoff here. If this change speeds up indexing we should do it. RAM is cheap. By far, the Posting instances consume the most RAM in DocumentsWriter. Right now each Posting is 66 bytes; this patch, once finished increases that to 68 bytes. I don't like increasing the byte usage of Posting unless there's a good counterbalance, which I think this change *may* have if we see that it improves indexing speed. I just checked: when indexing Wikipedia with a 64 MB buffer, each segment flushed has ~430,000 Posting instances. So the Posting instances alone account for 27 MB of the buffer. That means the added 2 bytes from this change will consume ~840 KB additional RAM, which is not insignificant loss of RAM efficiency. [Aside: by Zipf's law, the vast majority of these terms should occur rarely. Eg roughly half will occur only once. If we could find some way to represent these rare terms with a much more compact structure (Posting has alot of "overhead" to efficiently manage a long posting list) then we would greatly increase DW's RAM efficiency.] > 0xffff char is not a string terminator > -------------------------------------- > > Key: LUCENE-1241 > URL: https://issues.apache.org/jira/browse/LUCENE-1241 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Hiroaki Kawai > Attachments: ComparableCharSequence.java, LUCENE-1241.patch > > > Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but > it should not to be for some reasons. \uffff is not a terminator char itself > and we can't handle a string that really contains \uffff. And also, we can > calculate the end char position in a character sequence from the string > length that we already know. > However, I agree with the usage for assertion, that "\uffff" is placed after > at the end of a string in a char sequence. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]