Re: Lucene does NOT use UTF-8.

Ken Krugler Sat, 27 Aug 2005 14:11:56 -0700

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?

The use of 0xC0 0x80 to encode a U+0000 Unicode code point is anaspect of Java serialization of character streams. Java uses whatthey call "a modified version of UTF-8", though that's a really badway to describe it. It's a different Unicode encoding, one thatresembles UTF-8, but that's it.

It's not a matter of a simple switch. The VInt count at the head ofa Lucene string is not the number of Unicode code points the stringcontains. It's the number of Java chars necessary to contain thatstring. Code points above the BMP require 2 java chars, since theymust be represented by surrogate pairs. The same code point must berepresented by one character in legal UTF-8.
If Plucene counts the number of legal UTF-8 characters and assignsthat number as the VInt at the front of a string, when Java Lucenedecodes the string it will allocate an array of char which is toosmall to hold the string.

I think Jian was proposing that Lucene switch to using a true UTF-8encoding, which would make things a bit cleaner. And probably easierthan changing all references to CEUS-8 :)

And yes, given that the integer count is the number of UTF-16 codeunits required to represent the string, your code will need to do abit more processing when calculating the character count, but that'sa one-liner, right?


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to