Re: Lucene does NOT use UTF-8

Erik Hatcher Mon, 29 Aug 2005 01:30:45 -0700

On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:

I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolvethis
issue.... anyone raising a hand?
I could, but recent posts makes me think this is heading towards areligious debate :)

Ken - you mentioned taking the discussion off-line in a previouspost. Please don't. Let's keep it alive on java-dev until we have aresolution to it.

I think the following statements are all true:
a. Using UTF-8 for strings would make it easier for Lucene indexesto be used by other implementations besides the reference Javaversion.
b. It would be easy to tweak Lucene to read/write conformant UTF-8strings.

What, if any, performance impact would changing Java Lucene in thisregard have? (I realize this is rhetorical at this point, until asolution is at hand)

c. The hard(er) part would be backwards compatibility with olderindexes. I haven't looked at this enough to really know, but oneexample is the compound file (xx.cfs) format...I didn't see aversion number, and it contains strings.

I don't know the gory details, but we've made compatibility breakingchanges in the past and the current version of Lucene can open olderformats, but only write the most current format. I suspect it couldbe made to be backwards compatible. Worst case, we breakcompatibility in 2.0.

d. The documentation could be clearer on what is meant by the"string length", but this is a trivial change.


That change was made by Daniel soon after this discussion began.

What's unclear to me (not being a Perl, Python, etc jock) is howmuch easier it would be to get these other implementations workingwith Lucene, following a change to UTF-8. So I can't comment on thereturn on time required to change things.
I'm also curious about the existing CLucene & PyLucene ports. Wouldthey also need to be similarly modified, with the proposed changes?

PyLucene is literally the Java version of Lucene underneath (via GCJ/SWIG), so no worries there. CLucene would need to be changed, aswell as DotLucene and the other ports out there.

If the rest of the world of Lucene ports followed suit with PyLuceneand did the GCJ/SWIG thing, we'd have no problems :) What are thedisadvantages to following this model with Plucene?


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

Reply via email to