On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?

It's not a matter of a simple switch. The VInt count at the head of a Lucene string is not the number of Unicode code points the string contains. It's the number of Java chars necessary to contain that string. Code points above the BMP require 2 java chars, since they must be represented by surrogate pairs. The same code point must be represented by one character in legal UTF-8.

If Plucene counts the number of legal UTF-8 characters and assigns that number as the VInt at the front of a string, when Java Lucene decodes the string it will allocate an array of char which is too small to hold the string.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to