Re: Lucene does NOT use UTF-8

Ken Krugler Tue, 30 Aug 2005 09:59:51 -0700

On Monday 29 August 2005 19:56, Ken Krugler wrote:

 "Lucene writes strings as a VInt representing the length of the
 string in Java chars (UTF-16 code units), followed by the character
 data."


But wouldn't UTF-16 mean 2 bytes per character?

Yes, UTF-16 means two bytes per code unit. A Unicode character (codepoint) is encoded as either one or two UTF-16 code units.

That doesn't seem to be the
case.


The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which meansUTF-16 code units (well, sort of...see below). Looking at the code,IndexOutput.writeString() calls writeVInt() with the string length.

One related note. Java 1.4 supports Unicode 3.0, while Java 5.0supports Unicode 4.0. It was in Unicode 3.1 that supplementarycharacters (code points > U+0FFFF, ie outside of the BMP) were added,and the UTF-16 encoding formalized.

So I think the issue of non-BMP characters is currently a bitesoteric for Lucene, since I'm guessing there are other places in thecode (e.g. JDK calls used by Lucene) where non-BMP characters won'tbe properly handled. Though some quick tests indicate that there issome knowledge of surrogate pairs in 1.4 (e.g. converting a Stringw/surrogate pairs to UTF-8 does the right thing).


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

Reply via email to