Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless Wed, 26 Mar 2008 15:07:27 -0700

Yonik Seeley <[EMAIL PROTECTED]> wrote:

>  Hmmm, can't we always do it by unicode code point?
>  When do we need UTF-16 order?


In theory, we can.  I think the sort order doesn't matter much, as
long as everyone (writers & readers) agree what it is.  I think
unicode code point order is more "standards compliant" too.

A big benefit is then we could leave things (eg TermBuffer and maybe
eventually Term, FieldCache) as UTF8 bytes and save on the conversion
cost when reading.

But I don't think Java provides a way to do this comparison?  However
it's not hard to implement your own:

  http://www.icu-project.org/docs/papers/utf16_code_point_order.html

But then I worried about how much slower that code is than
String.compareTo, and, I found alot of places where innocent compareTo
or < or > needed to be changed to this method call.  Field name
comparisons would have to be fixed too.  Then for backwards
compatibility all of these places that do comparisons would have to
fallback to the Java way when interacting with an older segment.

I think we can still explore this?  It just seemed way too big to
glomm into the already-big changes in LUCENE-510.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Reply via email to