Re: Lucene does NOT use UTF-8.

Marvin Humphrey Sun, 28 Aug 2005 22:40:04 -0700

Hello, Robert...

On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:

Sorry, but I think you are barking up the wrong tree... and yourtone isquite bizarre. My personal OPINION is that your "script" languageis an
abomination, and anyone that develops in it is clearly hurting the
advancement of all software - but that is another story, anddoesn't matter
much to the discussion - in a similar fashion your choice of words is
clearly not gong to help matters.

My personal perspective is a utilitarian one: languages, platforms,they all come and go eventually, and in between a lot of stuff getsdone. I enjoy and appreciate Java (what I know of it), and I watchedthe Ruby/Java spat a little while ago with dismay. The enmity is notreturned. :)

It may be less efficient to decode in other languages, but I don'tthink theoriginal Lucene designers were too worried about the efficienciesof other
languages/platforms.

That may be the case. I suppose we're about to find out howimportant the Lucene development community considers interchange.The phrase "standard UTF-8" in the documentation led me to believethat the intention was to deploy honest-to-goodness UTF-8. In fact,as was pointed out, the early versions of the Unicode standard werenot very clear. Lucene was originally begun in 1998, and UnicodeCorrigendum #1: "UTF-8 Shortest Form" wasn't released until 2001. Mybest guess is that it was supposed to be legal UTF-8 and that the non-conformance is unintentional.

Otis Gospodnetic raised objections when the Plucene project made thedecision to abandon index compatibility with Java Lucene. I've beenarguing that that decision ought to be reconsidered. It will make iteasier to achieve this shared goal of interoperability if Plucenedoes not have to go out of its way to defeat measures painstakinglyput in place by the Perl5Porters team to ensure secure and robustUnicode support.

One of the reasons I have placed my own search engine project on holdwas that I concluded I could not improve in a meaningful way onLucene's file format. It's really a marvelous piece of work.Perhaps it will become the TIFF of inverted index formats. It seemsto me that the Lucene project would benefit from having it widelyadopted. I'd like to help with that.

Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8")is all
that is needed.

Thank you for the tip. At first blush, I'm concerned that those maybe difficult to make work with OutputStream's readByte() withoutincurring a performance penalty, but if I'm wrong and it's six-of-one-half-dozen-of-another for Java Lucene, then if a change is going tobe made, I'll argue for that one. That would harmonize with the waybinary field data is stored, assuming that I can trust that portionof the spec document. ;)


Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to