Hello, Robert...
On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:
Sorry, but I think you are barking up the wrong tree... and your
tone is
quite bizarre. My personal OPINION is that your "script" language
is an
abomination, and anyone that develops in it is clearly hurting the
advancement of all software - but that is another story, and
doesn't matter
much to the discussion - in a similar fashion your choice of words is
clearly not gong to help matters.
My personal perspective is a utilitarian one: languages, platforms,
they all come and go eventually, and in between a lot of stuff gets
done. I enjoy and appreciate Java (what I know of it), and I watched
the Ruby/Java spat a little while ago with dismay. The enmity is not
returned. :)
It may be less efficient to decode in other languages, but I don't
think the
original Lucene designers were too worried about the efficiencies
of other
languages/platforms.
That may be the case. I suppose we're about to find out how
important the Lucene development community considers interchange.
The phrase "standard UTF-8" in the documentation led me to believe
that the intention was to deploy honest-to-goodness UTF-8. In fact,
as was pointed out, the early versions of the Unicode standard were
not very clear. Lucene was originally begun in 1998, and Unicode
Corrigendum #1: "UTF-8 Shortest Form" wasn't released until 2001. My
best guess is that it was supposed to be legal UTF-8 and that the non-
conformance is unintentional.
Otis Gospodnetic raised objections when the Plucene project made the
decision to abandon index compatibility with Java Lucene. I've been
arguing that that decision ought to be reconsidered. It will make it
easier to achieve this shared goal of interoperability if Plucene
does not have to go out of its way to defeat measures painstakingly
put in place by the Perl5Porters team to ensure secure and robust
Unicode support.
One of the reasons I have placed my own search engine project on hold
was that I concluded I could not improve in a meaningful way on
Lucene's file format. It's really a marvelous piece of work.
Perhaps it will become the TIFF of inverted index formats. It seems
to me that the Lucene project would benefit from having it widely
adopted. I'd like to help with that.
Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8")
is all
that is needed.
Thank you for the tip. At first blush, I'm concerned that those may
be difficult to make work with OutputStream's readByte() without
incurring a performance penalty, but if I'm wrong and it's six-of-one-
half-dozen-of-another for Java Lucene, then if a change is going to
be made, I'll argue for that one. That would harmonize with the way
binary field data is stored, assuming that I can trust that portion
of the spec document. ;)
Cheers,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]