On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
religious debate :)
Ken - you mentioned taking the discussion off-line in a previous
post. Please don't. Let's keep it alive on java-dev until we have
a resolution to it.
I think the following statements are all true:
a. Using UTF-8 for strings would make it easier for Lucene indexes
to be used by other implementations besides the reference Java
version.
b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.
What, if any, performance impact would changing Java Lucene in this
regard have? (I realize this is rhetorical at this point, until a
solution is at hand)
Almost zero. A tiny hit when reading/writing surrogate pairs, to
properly encode them as a 4 byte UTF-8 sequence versus two 3-byte
sequences.
c. The hard(er) part would be backwards compatibility with older
indexes. I haven't looked at this enough to really know, but one
example is the compound file (xx.cfs) format...I didn't see a
version number, and it contains strings.
I don't know the gory details, but we've made compatibility breaking
changes in the past and the current version of Lucene can open older
formats, but only write the most current format. I suspect it could
be made to be backwards compatible. Worst case, we break
compatibility in 2.0.
Ronald is correct in that it would be easy to make the reader handle
both "Java modified UTF-8" and UTF-8, and the writer always output
UTF-8. So the only problem would be if older versions of Lucene (or
maybe CLucene) wound up trying to read strings that contained 4-byte
UTF-8 sequences, as they wouldn't know how to convert this into two
UTF-16 Java chars.
Since 4-byte UTF-8 sequences are only for characters outside of the
BMP, and these are rare, it seems like an OK thing to do, but that's
just my uninformed view.
d. The documentation could be clearer on what is meant by the
"string length", but this is a trivial change.
That change was made by Daniel soon after this discussion began.
Daniel changed the definition of Chars, but String section still
needs to be clarified. Currently it says:
"Lucene writes strings as a VInt representing the length, followed by
the character data".
It should read:
"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]