Re: Lucene does NOT use UTF-8

Ken Krugler Mon, 29 Aug 2005 01:01:35 -0700

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?

I could, but recent posts makes me think this is heading towards areligious debate :)


I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes tobe used by other implementations besides the reference Java version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

c. The hard(er) part would be backwards compatibility with olderindexes. I haven't looked at this enough to really know, but oneexample is the compound file (xx.cfs) format...I didn't see a versionnumber, and it contains strings.

d. The documentation could be clearer on what is meant by the "stringlength", but this is a trivial change.

What's unclear to me (not being a Perl, Python, etc jock) is how mucheasier it would be to get these other implementations working withLucene, following a change to UTF-8. So I can't comment on the returnon time required to change things.

I'm also curious about the existing CLucene & PyLucene ports. Wouldthey also need to be similarly modified, with the proposed changes?

One final point. I doubt people have been adding strings withembedded nulls, and text outside of the Unicode BMP is also veryrare. So _most_ Lucene indexes only contain valid UTF-8 data. It'sonly the above two edge cases that create an interoperability problem.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

Reply via email to