Marvin Humphrey wrote:

Michael McCandless resolved LUCENE-510.

Congratulations.  :)

Thanks. I didn't quite realize what I was getting myself into when I said "yes" on that issue!

When I wrote my initial patch, I saw a performance degradation of c. 30% in my indexing benchmarks.

I think it was 20%.

Repeated reallocation was presumably one culprit: when length in Java chars is stored in the index, you only need to allocate once, whereas when reading in UTF-8, you can't know just how much memory you need until the read completes. Furthermore, at write-time, you can't look at something composed of 16-bit chars and know what the byte-length of its UTF-8 representation will be without pre-scanning.

Right, not doing allocations was pretty much it (the getBytes method of String was most of the slowdown I think). I was also able to eliminate another per-term scan we were doing in DocumentsWriter and fold it into the conversion.

I ended up creating custom conversion methods (UTF8toUTF16 & vice- versa) to do this conversion into a re-used byte[] or char[], which grow as needed, then I just bulk-write the bytes. I think this is not much slower than before (modified UTF8) since it also had to go character by character w/ ifs inside that inner loop.

I'm less happy with the 11% slowdown on TermEnum, and that's even with the optimization to incrementally decode only the "new" UTF-8 bytes as we are reading the changed suffix of each term, reusing the already-decoded UTF16 chars from the previous term. This will slowdown populating a FieldCache, which is already slow. But LUCENE-831 and LUCENE-1231 should fix that.

Are the string diffs and comparisons now performed against raw bytes, so that fewer conversions are needed?

Alas, not yet: Lucene still uses UTF16 java chars internally. The conversion to UTF-8 happens "at the last minute" when writing, and "immediately" when reading.

I started exploring keeping UTF-8 bytes further in, but it quickly got messy because it would require changing how the term infos are sorted to be unicode code point order. Comparing bytes in UTF-8 is the same as comparing unicode code points, which is nice. But comparing UTF-16 values is almost but not quite the same. So suddenly everywhere where a string comparison takes place I had to assess whether that comparison should be by unicode code point, and call our own method for doing so. It quickly became a "big" project so I ran back to sorting by UTF-16 value.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to