On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:
Yonik Seeley wrote:
I've been looking around... do you have a pointer to the source
where just the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the
problem that would be posed by the prefix length being a byte count.
TermBuffer.java:66
Things could work fine if the prefix length were a byte count. A
byte buffer could easily be constructed that contains the full byte
sequence (prefix + suffix), and then this could be converted to a
String. The inefficiency would be if prefix were re-converted from
UTF-8 for each term, e.g., in order to compare it to the target.
Prefixes are frequently longer than suffixes, so this could be
significant. Does that make sense? I don't know whether it would
actually be significant, although TermBuffer.java was added
recently as a measurable performance enhancement, so this is
performance critical code.
We need to stop discussing this in the abstract and start coding
alternatives and benchmarking them. Is
java.nio.charset.CharsetEncoder fast enough? Will moving things
through CharBuffer and ByteBuffer be too slow? Should Lucene keep
maintaining its own UTF-8 implementation for performance? I don't
know, only some experiments will tell.
Doug
I don't know if it matters for Lucene usage. But if using
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a
significant problem, it's probably due to startup/init time of these
methods for individually converting many small strings, not
inherently due to UTF-8 usage. I'm confident that a custom UTF-8
implementation can almost completely eliminate these issues. I've
done this before for binary XML with great success, and it could
certainly be done for lucene just as well. Bottom line: It's probably
an issue that can be dealt with via proper impl; it probably
shouldn't dictate design directions.
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]