On 8/30/05, Ken Krugler <[EMAIL PROTECTED]> wrote: > > >Daniel Naber wrote: > > > >>On Monday 29 August 2005 19:56, Ken Krugler wrote: > >> > >>>"Lucene writes strings as a VInt representing the length of the > >>>string in Java chars (UTF-16 code units), followed by the character > >>>data." > >>> > >>> > >>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem > >>to be the case. > >> > >UTF-16 is a fixed 2 byte/char representation. > > I hate to keep beating this horse, but I want to emphasize that it's > 2 bytes per Java char (or UTF-16 code unit), not Unicode character > (code point).
There's more horse beating on Java and Unicode 4 in this blog entry: http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html.