Daniel Naber wrote:
On Monday 29 August 2005 19:56, Ken Krugler wrote:
"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem
to be the case.
UTF-16 is a fixed 2 byte/char representation.
I hate to keep beating this horse, but I want to emphasize that it's
2 bytes per Java char (or UTF-16 code unit), not Unicode character
(code point).
But one cannot equate the character count with the byte count. Each
Java char is 2 bytes. I think all that is being said is that the
VInt is equal to str.length() as java gives it.
On an unrelated project we are determining whether we should use a
denormalized (letter followed by an accents) or a normalized form
(letter with accents) of accented characters as we present the text
to a GUI. We have found that font support varies but appears to be
better for denormalized. This is not an issue for storage, as it can
be transformed before it goes to screen. However, it is useful to
know which form it is in.
The reason I mention this is that I seem to remember that the length
of the java string varies with the representation.
String.length() is the number of Java chars, which always uses
UTF-16. If you normalize text, then yes that can change the number of
code units and thus the length of the string, but so can doing any
kind of text munging (e.g. replacement) operation on characters in
the string.
So then the count would not be the number of glyphs that the user
sees. Please correct me if I am wrong.
All kinds of mxn mappings (both at the layout engine level, and using
font tables) are possible when going from Unicode characters to
display glyphs. Plus zero-width left-kerning glyphs would also alter
the relationship between # of visual "characters" and backing store
characters.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]