Hello, Robert...

On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:

Sorry, but I think you are barking up the wrong tree... and your tone is quite bizarre. My personal OPINION is that your "script" language is an
abomination, and anyone that develops in it is clearly hurting the
advancement of all software - but that is another story, and doesn't matter
much to the discussion - in a similar fashion your choice of words is
clearly not gong to help matters.

My personal perspective is a utilitarian one: languages, platforms, they all come and go eventually, and in between a lot of stuff gets done. I enjoy and appreciate Java (what I know of it), and I watched the Ruby/Java spat a little while ago with dismay. The enmity is not returned. :)

It may be less efficient to decode in other languages, but I don't think the original Lucene designers were too worried about the efficiencies of other
languages/platforms.

That may be the case. I suppose we're about to find out how important the Lucene development community considers interchange. The phrase "standard UTF-8" in the documentation led me to believe that the intention was to deploy honest-to-goodness UTF-8. In fact, as was pointed out, the early versions of the Unicode standard were not very clear. Lucene was originally begun in 1998, and Unicode Corrigendum #1: "UTF-8 Shortest Form" wasn't released until 2001. My best guess is that it was supposed to be legal UTF-8 and that the non- conformance is unintentional.

Otis Gospodnetic raised objections when the Plucene project made the decision to abandon index compatibility with Java Lucene. I've been arguing that that decision ought to be reconsidered. It will make it easier to achieve this shared goal of interoperability if Plucene does not have to go out of its way to defeat measures painstakingly put in place by the Perl5Porters team to ensure secure and robust Unicode support.

One of the reasons I have placed my own search engine project on hold was that I concluded I could not improve in a meaningful way on Lucene's file format. It's really a marvelous piece of work. Perhaps it will become the TIFF of inverted index formats. It seems to me that the Lucene project would benefit from having it widely adopted. I'd like to help with that.

Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8") is all
that is needed.

Thank you for the tip. At first blush, I'm concerned that those may be difficult to make work with OutputStream's readByte() without incurring a performance penalty, but if I'm wrong and it's six-of-one- half-dozen-of-another for Java Lucene, then if a change is going to be made, I'll argue for that one. That would harmonize with the way binary field data is stored, assuming that I can trust that portion of the spec document. ;)

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to