Re: Lucene does NOT use UTF-8

DM Smith Tue, 30 Aug 2005 10:28:15 -0700

Daniel Naber wrote:

On Monday 29 August 2005 19:56, Ken Krugler wrote:

"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."

But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be thecase.

UTF-16 is a fixed 2 byte/char representation. But one cannot equate thecharacter count with the byte count. Each Java char is 2 bytes. I thinkall that is being said is that the VInt is equal to str.length() as javagives it.

On an unrelated project we are determining whether we should use adenormalized (letter followed by an accents) or a normalized form(letter with accents) of accented characters as we present the text to aGUI. We have found that font support varies but appears to be better fordenormalized. This is not an issue for storage, as it can be transformedbefore it goes to screen. However, it is useful to know which form it is in.

The reason I mention this is that I seem to remember that the length ofthe java string varies with the representation. So then the count wouldnot be the number of glyphs that the user sees. Please correct me if Iam wrong.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

Reply via email to