On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?

I could, but recent posts makes me think this is heading towards a religious debate :)

Ken - you mentioned taking the discussion off-line in a previous post. Please don't. Let's keep it alive on java-dev until we have a resolution to it.

I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version.

b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

What, if any, performance impact would changing Java Lucene in this regard have? (I realize this is rhetorical at this point, until a solution is at hand)

Almost zero. A tiny hit when reading/writing surrogate pairs, to properly encode them as a 4 byte UTF-8 sequence versus two 3-byte sequences.

c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings.

I don't know the gory details, but we've made compatibility breaking changes in the past and the current version of Lucene can open older formats, but only write the most current format. I suspect it could be made to be backwards compatible. Worst case, we break compatibility in 2.0.

Ronald is correct in that it would be easy to make the reader handle both "Java modified UTF-8" and UTF-8, and the writer always output UTF-8. So the only problem would be if older versions of Lucene (or maybe CLucene) wound up trying to read strings that contained 4-byte UTF-8 sequences, as they wouldn't know how to convert this into two UTF-16 Java chars.

Since 4-byte UTF-8 sequences are only for characters outside of the BMP, and these are rare, it seems like an OK thing to do, but that's just my uninformed view.

d. The documentation could be clearer on what is meant by the "string length", but this is a trivial change.

That change was made by Daniel soon after this discussion began.

Daniel changed the definition of Chars, but String section still needs to be clarified. Currently it says:

"Lucene writes strings as a VInt representing the length, followed by the character data".

It should read:

"Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data."

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to