RE: Lucene does NOT use UTF-8.

Ken Krugler Tue, 30 Aug 2005 10:28:39 -0700

I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.


It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to writeout the VInt value as UTF-8 bytes versus Java chars, the Java Stringhas to either be converted to UTF-8 in memory first, or pre-scanned.The first is a memory hit, and the second is a performance hit. Idon't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out thebytes first and then fill in the correct value later.


-- Ken

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler wrote:

 The remaining issue is dealing with old-format indexes.


I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.

 I'm going to take this off-list now [ ... ]


Please don't.  It's better to have a record of the discussion.

Doug



--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene does NOT use UTF-8.

Reply via email to