Re: Lucene does NOT use UTF-8.

Marvin Humphrey Sat, 27 Aug 2005 07:08:52 -0700

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use trueUTF-8 tostore terms. Maybe it is just a legacy issue that the modifiedUTF-8 is
used?

It's not a matter of a simple switch. The VInt count at the head ofa Lucene string is not the number of Unicode code points the stringcontains. It's the number of Java chars necessary to contain thatstring. Code points above the BMP require 2 java chars, since theymust be represented by surrogate pairs. The same code point must berepresented by one character in legal UTF-8.

If Plucene counts the number of legal UTF-8 characters and assignsthat number as the VInt at the front of a string, when Java Lucenedecodes the string it will allocate an array of char which is toosmall to hold the string.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to