Re: Lucene does NOT use UTF-8

Ken Krugler Mon, 29 Aug 2005 11:00:28 -0700

On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?
I could, but recent posts makes me think this is heading towards areligious debate :)
Ken - you mentioned taking the discussion off-line in a previouspost. Please don't. Let's keep it alive on java-dev until we havea resolution to it.
I think the following statements are all true:
a. Using UTF-8 for strings would make it easier for Lucene indexesto be used by other implementations besides the reference Javaversion.
b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.
What, if any, performance impact would changing Java Lucene in thisregard have? (I realize this is rhetorical at this point, until asolution is at hand)

Almost zero. A tiny hit when reading/writing surrogate pairs, toproperly encode them as a 4 byte UTF-8 sequence versus two 3-bytesequences.

c. The hard(er) part would be backwards compatibility with olderindexes. I haven't looked at this enough to really know, but oneexample is the compound file (xx.cfs) format...I didn't see aversion number, and it contains strings.
I don't know the gory details, but we've made compatibility breakingchanges in the past and the current version of Lucene can open olderformats, but only write the most current format. I suspect it couldbe made to be backwards compatible. Worst case, we breakcompatibility in 2.0.

Ronald is correct in that it would be easy to make the reader handleboth "Java modified UTF-8" and UTF-8, and the writer always outputUTF-8. So the only problem would be if older versions of Lucene (ormaybe CLucene) wound up trying to read strings that contained 4-byteUTF-8 sequences, as they wouldn't know how to convert this into twoUTF-16 Java chars.

Since 4-byte UTF-8 sequences are only for characters outside of theBMP, and these are rare, it seems like an OK thing to do, but that'sjust my uninformed view.

d. The documentation could be clearer on what is meant by the"string length", but this is a trivial change.
That change was made by Daniel soon after this discussion began.

Daniel changed the definition of Chars, but String section stillneeds to be clarified. Currently it says:

"Lucene writes strings as a VInt representing the length, followed bythe character data".


It should read:

"Lucene writes strings as a VInt representing the length of thestring in Java chars (UTF-16 code units), followed by the characterdata."


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

Reply via email to