Ken Krugler wrote:
The remaining issue is dealing with old-format indexes.

I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for back-compatibility.) Implementations can be modified to pass the version around if they wish to be back-compatible, or they can simply throw exceptions for old format indexes.

After looking at it a bit more, I think there's no problem w/having the new code read both UTF-8 and Java modified UTF-8, and always write correct UTF-8. So the only compatibility issue would be new Lucene indexes w/non-BMP characters being processed by older versions of Lucene (or ports that weren't updated).

I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations.

Agreed, though just to clarify, it's the number of UTF-16 code units (Java chars), not the number of Unicode code points (Unicode characters).

I'm going to take this off-list now [ ... ]

Please don't.  It's better to have a record of the discussion.

No problem. I was worried that the discussion Marvin & I were having was turning into a two person IM chat via email.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to