Daniel Naber wrote:

On Monday 29 August 2005 19:56, Ken Krugler wrote:
"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case.

UTF-16 is a fixed 2 byte/char representation. But one cannot equate the character count with the byte count. Each Java char is 2 bytes. I think all that is being said is that the VInt is equal to str.length() as java gives it.

On an unrelated project we are determining whether we should use a denormalized (letter followed by an accents) or a normalized form (letter with accents) of accented characters as we present the text to a GUI. We have found that font support varies but appears to be better for denormalized. This is not an issue for storage, as it can be transformed before it goes to screen. However, it is useful to know which form it is in.

The reason I mention this is that I seem to remember that the length of the java string varies with the representation. So then the count would not be the number of glyphs that the user sees. Please correct me if I am wrong.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to