Ken Krugler wrote:
The remaining issue is dealing with old-format indexes.
I think that revving the version number on the segments file would
be a good start. This file must be read before any others. Its
current version is -1 and would become -2. (All positive values are
version 0, for back-compatibility.) Implementations can be modified
to pass the version around if they wish to be back-compatible, or
they can simply throw exceptions for old format indexes.
After looking at it a bit more, I think there's no problem w/having
the new code read both UTF-8 and Java modified UTF-8, and always
write correct UTF-8. So the only compatibility issue would be new
Lucene indexes w/non-BMP characters being processed by older versions
of Lucene (or ports that weren't updated).
I would argue that the length written be the number of characters in
the string, rather than the number of bytes written, since that can
minimize string memory allocations.
Agreed, though just to clarify, it's the number of UTF-16 code units
(Java chars), not the number of Unicode code points (Unicode
characters).
I'm going to take this off-list now [ ... ]
Please don't. It's better to have a record of the discussion.
No problem. I was worried that the discussion Marvin & I were having
was turning into a two person IM chat via email.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]