Hi, Ken, Thanks for your email. You are right, I was meant to propose that Lucene switch to use true UTF-8, rather than having to work around this issue by fixing the caused problems elsewhere.
Also, conforming to standards like UTF-8 will make the code easier for new developers to pick up. Just my 2 cents. Thanks, Jian On 8/27/05, Ken Krugler <[EMAIL PROTECTED]> wrote: > > >On Aug 26, 2005, at 10:14 PM, jian chen wrote: > > > >>It seems to me that in theory, Lucene storage code could use true UTF-8 > to > >>store terms. Maybe it is just a legacy issue that the modified UTF-8 is > >>used? > > The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an > aspect of Java serialization of character streams. Java uses what > they call "a modified version of UTF-8", though that's a really bad > way to describe it. It's a different Unicode encoding, one that > resembles UTF-8, but that's it. > > >It's not a matter of a simple switch. The VInt count at the head of > >a Lucene string is not the number of Unicode code points the string > >contains. It's the number of Java chars necessary to contain that > >string. Code points above the BMP require 2 java chars, since they > >must be represented by surrogate pairs. The same code point must be > >represented by one character in legal UTF-8. > > > >If Plucene counts the number of legal UTF-8 characters and assigns > >that number as the VInt at the front of a string, when Java Lucene > >decodes the string it will allocate an array of char which is too > >small to hold the string. > > I think Jian was proposing that Lucene switch to using a true UTF-8 > encoding, which would make things a bit cleaner. And probably easier > than changing all references to CEUS-8 :) > > And yes, given that the integer count is the number of UTF-16 code > units required to represent the string, your code will need to do a > bit more processing when calculating the character count, but that's > a one-liner, right? > > -- Ken > -- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-470-9200 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >