I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding.
It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of the lookups performed by the String class. -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 4:24 PM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Ken Krugler wrote: > The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for back-compatibility.) Implementations can be modified to pass the version around if they wish to be back-compatible, or they can simply throw exceptions for old format indexes. I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations. > I'm going to take this off-list now [ ... ] Please don't. It's better to have a record of the discussion. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]