Re: Lucene does NOT use UTF-8.

Doug Cutting Wed, 31 Aug 2005 10:04:45 -0700

Wolfgang Hoschek wrote:

I don't know if it matters for Lucene usage. But if usingCharsetEncoder/CharBuffer/ByteBuffer should turn out to be asignificant problem, it's probably due to startup/init time of thesemethods for individually converting many small strings, not inherentlydue to UTF-8 usage. I'm confident that a custom UTF-8 implementationcan almost completely eliminate these issues. I've done this before forbinary XML with great success, and it could certainly be done forlucene just as well. Bottom line: It's probably an issue that can bedealt with via proper impl; it probably shouldn't dictate designdirections.

Good point. Currently Lucene already has its own (buggy) UTF-8implementation for performance, so that wouldn't really be a big change.

The big question now seems to be whether the stored character sequencelengths should be in bytes or characters. Bytes might be fast andsimple (whether we implement our own UTF-8 in Java or not) but are notback-compatible. So do we bite the bullet and make a very incompatiblechange to index formats? Or do we make these counts be unicodecharacters (which is mostly back-compatible) and make the code a bitmore awkward? Some implementations would be nice to see just howawkward things get.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to