On Monday 29 August 2005 19:56, Ken Krugler wrote:
"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."
But wouldn't UTF-16 mean 2 bytes per character?
Yes, UTF-16 means two bytes per code unit. A Unicode character (code
point) is encoded as either one or two UTF-16 code units.
That doesn't seem to be the
case.
The case where? You mean in what actually gets written out?
String.length() is the length in terms of Java chars, which means
UTF-16 code units (well, sort of...see below). Looking at the code,
IndexOutput.writeString() calls writeVInt() with the string length.
One related note. Java 1.4 supports Unicode 3.0, while Java 5.0
supports Unicode 4.0. It was in Unicode 3.1 that supplementary
characters (code points > U+0FFFF, ie outside of the BMP) were added,
and the UTF-16 encoding formalized.
So I think the issue of non-BMP characters is currently a bit
esoteric for Lucene, since I'm guessing there are other places in the
code (e.g. JDK calls used by Lucene) where non-BMP characters won't
be properly handled. Though some quick tests indicate that there is
some knowledge of surrogate pairs in 1.4 (e.g. converting a String
w/surrogate pairs to UTF-8 does the right thing).
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]