Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless Wed, 26 Mar 2008 15:35:39 -0700


Yonik Seeley wrote:

On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

Yonik Seeley <[EMAIL PROTECTED]> wrote:

 Hmmm, can't we always do it by unicode code point?
 When do we need UTF-16 order?


 In theory, we can.  I think the sort order doesn't matter much, as
 long as everyone (writers & readers) agree what it is.  I think
 unicode code point order is more "standards compliant" too.

 A big benefit is then we could leave things (eg TermBuffer and maybe

eventually Term, FieldCache) as UTF8 bytes and save on theconversion

 cost when reading.

But I don't think Java provides a way to do this comparison?However

 it's not hard to implement your own:

  http://www.icu-project.org/docs/papers/utf16_code_point_order.html


Not sure I follow... you just do a byte-by-byte comparison right?  For
ASCII, this should be slightly faster (same number of comparisons,
less memory space and hence less cache space overall).

Sorry, you're right: if you're working with byte[] at the time, abyte by byte comparison of UTF8 gives you the same order as unicodecode point.

It's when you need to compare a String or char[] to one another, orto a UTF8 byte[], that you need that code.

 But then I worried about how much slower that code is than

String.compareTo, and, I found alot of places where innocentcompareTo

 or < or > needed to be changed to this method call.  Field name
 comparisons would have to be fixed too.  Then for backwards
 compatibility all of these places that do comparisons would have to
 fallback to the Java way when interacting with an older segment.


Oh... older segments.  Yeah, I was speaking "theoretically".


Yeah.

 I think we can still explore this?  It just seemed way too big to
 glomm into the already-big changes in LUCENE-510.


Yeah, I was thinking of some of this more along the lines of Lucene 3.
A term could contain a byte array instead of a String.  A String
constructor would convert to UTF8 and then do lookups in the index
(simple byte comparisons, no charset encoding).  A byte constructor
for Term would also be allowed.  Things like TermEnumerators would
keep everything in bytes, the tii would be in bytes, etc.


Yup.

One could also think about ways to directly index bytes too.

Right, DocumentsWriter could hold its terms in byte[] and save time/space when terms are ascii.

Is it all worth it?  I really don't know.

Right, that's where I started to wonder. It felt very much like Iwas "going against the grain of Java" as the changes started to pileup ...


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Reply via email to