On Aug 26, 2005, at 10:14 PM, jian chen wrote:
It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?
The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an
aspect of Java serialization of character streams. Java uses what
they call "a modified version of UTF-8", though that's a really bad
way to describe it. It's a different Unicode encoding, one that
resembles UTF-8, but that's it.
It's not a matter of a simple switch. The VInt count at the head of
a Lucene string is not the number of Unicode code points the string
contains. It's the number of Java chars necessary to contain that
string. Code points above the BMP require 2 java chars, since they
must be represented by surrogate pairs. The same code point must be
represented by one character in legal UTF-8.
If Plucene counts the number of legal UTF-8 characters and assigns
that number as the VInt at the front of a string, when Java Lucene
decodes the string it will allocate an array of char which is too
small to hold the string.
I think Jian was proposing that Lucene switch to using a true UTF-8
encoding, which would make things a bit cleaner. And probably easier
than changing all references to CEUS-8 :)
And yes, given that the integer count is the number of UTF-16 code
units required to represent the string, your code will need to do a
bit more processing when calculating the character count, but that's
a one-liner, right?
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]