On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:

Yonik Seeley wrote:

I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count.


TermBuffer.java:66

Things could work fine if the prefix length were a byte count. A byte buffer could easily be constructed that contains the full byte sequence (prefix + suffix), and then this could be converted to a String. The inefficiency would be if prefix were re-converted from UTF-8 for each term, e.g., in order to compare it to the target. Prefixes are frequently longer than suffixes, so this could be significant. Does that make sense? I don't know whether it would actually be significant, although TermBuffer.java was added recently as a measurable performance enhancement, so this is performance critical code.

We need to stop discussing this in the abstract and start coding alternatives and benchmarking them. Is java.nio.charset.CharsetEncoder fast enough? Will moving things through CharBuffer and ByteBuffer be too slow? Should Lucene keep maintaining its own UTF-8 implementation for performance? I don't know, only some experiments will tell.

Doug


I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to UTF-8 usage. I'm confident that a custom UTF-8 implementation can almost completely eliminate these issues. I've done this before for binary XML with great success, and it could certainly be done for lucene just as well. Bottom line: It's probably an issue that can be dealt with via proper impl; it probably shouldn't dictate design directions.

Wolfgang.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to