Re: Lucene does NOT use UTF-8.

Ken Krugler Tue, 30 Aug 2005 16:18:40 -0700

Yonik Seeley wrote:
A related problem exists even if the prefix length vInt is changedto represent the number of unicode chars (as opposed to number ofjava chars), right? The prefix length is no longer the offset intothe char[] to put the suffix.
Yes, I suppose this is a problem too.  Sigh.
Another approach might be to convert the target to a UTF-8 byte[]and do all comparisons on byte[]. UTF-8 has some very niceproperties, including that the byte[] representation of UTF-8strings compare the same as UCS-4 would.
I was not aware of that, but I see you are correct:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
      same as if ordered by character numbers.

(From http://www.faqs.org/rfcs/rfc3629.html)
That makes the byte representation much more palatable, since Luceneorders terms lexicographically.


Where/how is the Lucene ordering of terms used?

I'm asking because people often confuse lexicographic order with"dictionary" order, whereas in the context of UTF-8 it just means"the same order as Unicode code points". And the order of Java charswould be the same as for Unicode code points, other than non-BMPcharacters.


Thanks,

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to