Ken Krugler
Mon, 29 Aug 2005 01:01:35 -0700
I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue.... anyone raising a hand?
I could, but recent posts makes me think this is heading towards a religious debate :)
I think the following statements are all true:a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version.
b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings.
d. The documentation could be clearer on what is meant by the "string length", but this is a trivial change.
What's unclear to me (not being a Perl, Python, etc jock) is how much easier it would be to get these other implementations working with Lucene, following a change to UTF-8. So I can't comment on the return on time required to change things.
I'm also curious about the existing CLucene & PyLucene ports. Would they also need to be similarly modified, with the proposed changes?
One final point. I doubt people have been adding strings with embedded nulls, and text outside of the Unicode BMP is also very rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's only the above two edge cases that create an interoperability problem.
-- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]