Wolfgang Hoschek wrote:
I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to UTF-8 usage. I'm confident that a custom UTF-8 implementation can almost completely eliminate these issues. I've done this before for binary XML with great success, and it could certainly be done for lucene just as well. Bottom line: It's probably an issue that can be dealt with via proper impl; it probably shouldn't dictate design directions.

Good point. Currently Lucene already has its own (buggy) UTF-8 implementation for performance, so that wouldn't really be a big change.

The big question now seems to be whether the stored character sequence lengths should be in bytes or characters. Bytes might be fast and simple (whether we implement our own UTF-8 in Java or not) but are not back-compatible. So do we bite the bullet and make a very incompatible change to index formats? Or do we make these counts be unicode characters (which is mostly back-compatible) and make the code a bit more awkward? Some implementations would be nice to see just how awkward things get.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to