Re: Lucene does NOT use UTF-8.

Doug Cutting Tue, 30 Aug 2005 14:21:29 -0700

Yonik Seeley wrote:

A related problem exists even if the prefix length vInt is changed torepresent the number of unicode chars (as opposed to number of java chars),right? The prefix length is no longer the offset into the char[] to put thesuffix.


Yes, I suppose this is a problem too.  Sigh.

Another approach might be to convert the target to a UTF-8 byte[]and do all comparisons on byte[]. UTF-8 has some very nice properties,including that the byte[] representation of UTF-8 strings compare the sameas UCS-4 would.


I was not aware of that, but I see you are correct:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
      same as if ordered by character numbers.

(From http://www.faqs.org/rfcs/rfc3629.html)

That makes the byte representation much more palatable, since Luceneorders terms lexicographically.


Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to