Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Yonik Seeley Wed, 26 Mar 2008 14:31:47 -0700

On Wed, Mar 26, 2008 at 5:22 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>  > Are the string diffs and comparisons now performed against raw
>  > bytes, so that fewer conversions are needed?
>
>  Alas, not yet: Lucene still uses UTF16 java chars internally.  The
>  conversion to UTF-8 happens "at the last minute" when writing, and
>  "immediately" when reading.
>
>  I started exploring keeping UTF-8 bytes further in, but it quickly
>  got messy because it would require changing how the term infos are
>  sorted to be unicode code point order.  Comparing bytes in UTF-8 is
>  the same as comparing unicode code points, which is nice.  But
>  comparing UTF-16 values is almost but not quite the same.   So
>  suddenly everywhere where a string comparison takes place I had to
>  assess whether that comparison should be by unicode code point, and
>  call our own method for doing so.  It quickly became a "big" project
>  so I ran back to sorting by UTF-16 value.


Hmmm, can't we always do it by unicode code point?
When do we need UTF-16 order?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Reply via email to