On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?

The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an aspect of Java serialization of character streams. Java uses what they call "a modified version of UTF-8", though that's a really bad way to describe it. It's a different Unicode encoding, one that resembles UTF-8, but that's it.

It's not a matter of a simple switch. The VInt count at the head of a Lucene string is not the number of Unicode code points the string contains. It's the number of Java chars necessary to contain that string. Code points above the BMP require 2 java chars, since they must be represented by surrogate pairs. The same code point must be represented by one character in legal UTF-8.

If Plucene counts the number of legal UTF-8 characters and assigns that number as the VInt at the front of a string, when Java Lucene decodes the string it will allocate an array of char which is too small to hold the string.

I think Jian was proposing that Lucene switch to using a true UTF-8 encoding, which would make things a bit cleaner. And probably easier than changing all references to CEUS-8 :)

And yes, given that the integer count is the number of UTF-16 code units required to represent the string, your code will need to do a bit more processing when calculating the character count, but that's a one-liner, right?

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to