On Aug 26, 2005, at 10:14 PM, jian chen wrote:
It seems to me that in theory, Lucene storage code could use true
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified
UTF-8 is
used?
It's not a matter of a simple switch. The VInt count at the head of
a Lucene string is not the number of Unicode code points the string
contains. It's the number of Java chars necessary to contain that
string. Code points above the BMP require 2 java chars, since they
must be represented by surrogate pairs. The same code point must be
represented by one character in legal UTF-8.
If Plucene counts the number of legal UTF-8 characters and assigns
that number as the VInt at the front of a string, when Java Lucene
decodes the string it will allocate an array of char which is too
small to hold the string.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]