Daniel Naber wrote:

On Monday 29 August 2005 19:56, Ken Krugler wrote:

"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case.

UTF-16 is a fixed 2 byte/char representation.

I hate to keep beating this horse, but I want to emphasize that it's 2 bytes per Java char (or UTF-16 code unit), not Unicode character (code point).

But one cannot equate the character count with the byte count. Each Java char is 2 bytes. I think all that is being said is that the VInt is equal to str.length() as java gives it.

On an unrelated project we are determining whether we should use a denormalized (letter followed by an accents) or a normalized form (letter with accents) of accented characters as we present the text to a GUI. We have found that font support varies but appears to be better for denormalized. This is not an issue for storage, as it can be transformed before it goes to screen. However, it is useful to know which form it is in.

The reason I mention this is that I seem to remember that the length of the java string varies with the representation.

String.length() is the number of Java chars, which always uses UTF-16. If you normalize text, then yes that can change the number of code units and thus the length of the string, but so can doing any kind of text munging (e.g. replacement) operation on characters in the string.

So then the count would not be the number of glyphs that the user sees. Please correct me if I am wrong.

All kinds of mxn mappings (both at the layout engine level, and using font tables) are possible when going from Unicode characters to display glyphs. Plus zero-width left-kerning glyphs would also alter the relationship between # of visual "characters" and backing store characters.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to