Re: Lucene does NOT use UTF-8.

Ken Krugler Tue, 30 Aug 2005 09:59:50 -0700

Ken Krugler wrote:
The remaining issue is dealing with old-format indexes.
I think that revving the version number on the segments file wouldbe a good start. This file must be read before any others. Itscurrent version is -1 and would become -2. (All positive values areversion 0, for back-compatibility.) Implementations can be modifiedto pass the version around if they wish to be back-compatible, orthey can simply throw exceptions for old format indexes.

After looking at it a bit more, I think there's no problem w/havingthe new code read both UTF-8 and Java modified UTF-8, and alwayswrite correct UTF-8. So the only compatibility issue would be newLucene indexes w/non-BMP characters being processed by older versionsof Lucene (or ports that weren't updated).

I would argue that the length written be the number of characters inthe string, rather than the number of bytes written, since that canminimize string memory allocations.

Agreed, though just to clarify, it's the number of UTF-16 code units(Java chars), not the number of Unicode code points (Unicodecharacters).

I'm going to take this off-list now [ ... ]


Please don't.  It's better to have a record of the discussion.

No problem. I was worried that the discussion Marvin & I were havingwas turning into a two person IM chat via email.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to