On Wed, Mar 26, 2008 at 5:22 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Are the string diffs and comparisons now performed against raw > > bytes, so that fewer conversions are needed? > > Alas, not yet: Lucene still uses UTF16 java chars internally. The > conversion to UTF-8 happens "at the last minute" when writing, and > "immediately" when reading. > > I started exploring keeping UTF-8 bytes further in, but it quickly > got messy because it would require changing how the term infos are > sorted to be unicode code point order. Comparing bytes in UTF-8 is > the same as comparing unicode code points, which is nice. But > comparing UTF-16 values is almost but not quite the same. So > suddenly everywhere where a string comparison takes place I had to > assess whether that comparison should be by unicode code point, and > call our own method for doing so. It quickly became a "big" project > so I ran back to sorting by UTF-16 value.
Hmmm, can't we always do it by unicode code point? When do we need UTF-16 order? -Yonik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]