On Monday 29 August 2005 19:56, Ken Krugler wrote:

 "Lucene writes strings as a VInt representing the length of the
 string in Java chars (UTF-16 code units), followed by the character
 data."

But wouldn't UTF-16 mean 2 bytes per character?

Yes, UTF-16 means two bytes per code unit. A Unicode character (code point) is encoded as either one or two UTF-16 code units.

That doesn't seem to be the
case.

The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which means UTF-16 code units (well, sort of...see below). Looking at the code, IndexOutput.writeString() calls writeVInt() with the string length.

One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 supports Unicode 4.0. It was in Unicode 3.1 that supplementary characters (code points > U+0FFFF, ie outside of the BMP) were added, and the UTF-16 encoding formalized.

So I think the issue of non-BMP characters is currently a bit esoteric for Lucene, since I'm guessing there are other places in the code (e.g. JDK calls used by Lucene) where non-BMP characters won't be properly handled. Though some quick tests indicate that there is some knowledge of surrogate pairs in 1.4 (e.g. converting a String w/surrogate pairs to UTF-8 does the right thing).

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to