Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless Wed, 26 Mar 2008 14:23:25 -0700


Marvin Humphrey wrote:

Michael McCandless resolved LUCENE-510.


Congratulations.  :)

Thanks. I didn't quite realize what I was getting myself into when Isaid "yes" on that issue!

When I wrote my initial patch, I saw a performance degradation ofc. 30% in my indexing benchmarks.


I think it was 20%.

Repeated reallocation was presumably one culprit: when length inJava chars is stored in the index, you only need to allocate once,whereas when reading in UTF-8, you can't know just how much memoryyou need until the read completes. Furthermore, at write-time, youcan't look at something composed of 16-bit chars and know what thebyte-length of its UTF-8 representation will be without pre-scanning.

Right, not doing allocations was pretty much it (the getBytes methodof String was most of the slowdown I think). I was also able toeliminate another per-term scan we were doing in DocumentsWriter andfold it into the conversion.

I ended up creating custom conversion methods (UTF8toUTF16 & vice-versa) to do this conversion into a re-used byte[] or char[], whichgrow as needed, then I just bulk-write the bytes. I think this isnot much slower than before (modified UTF8) since it also had to gocharacter by character w/ ifs inside that inner loop.

I'm less happy with the 11% slowdown on TermEnum, and that's evenwith the optimization to incrementally decode only the "new" UTF-8bytes as we are reading the changed suffix of each term, reusing thealready-decoded UTF16 chars from the previous term. This willslowdown populating a FieldCache, which is already slow. ButLUCENE-831 and LUCENE-1231 should fix that.

Are the string diffs and comparisons now performed against rawbytes, so that fewer conversions are needed?

Alas, not yet: Lucene still uses UTF16 java chars internally. Theconversion to UTF-8 happens "at the last minute" when writing, and"immediately" when reading.

I started exploring keeping UTF-8 bytes further in, but it quicklygot messy because it would require changing how the term infos aresorted to be unicode code point order. Comparing bytes in UTF-8 isthe same as comparing unicode code points, which is nice. Butcomparing UTF-16 values is almost but not quite the same. Sosuddenly everywhere where a string comparison takes place I had toassess whether that comparison should be by unicode code point, andcall our own method for doing so. It quickly became a "big" projectso I ran back to sorting by UTF-16 value.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Reply via email to