Marvin Humphrey wrote:
Michael McCandless resolved LUCENE-510.
Congratulations. :)
Thanks. I didn't quite realize what I was getting myself into when I
said "yes" on that issue!
When I wrote my initial patch, I saw a performance degradation of
c. 30% in my indexing benchmarks.
I think it was 20%.
Repeated reallocation was presumably one culprit: when length in
Java chars is stored in the index, you only need to allocate once,
whereas when reading in UTF-8, you can't know just how much memory
you need until the read completes. Furthermore, at write-time, you
can't look at something composed of 16-bit chars and know what the
byte-length of its UTF-8 representation will be without pre-scanning.
Right, not doing allocations was pretty much it (the getBytes method
of String was most of the slowdown I think). I was also able to
eliminate another per-term scan we were doing in DocumentsWriter and
fold it into the conversion.
I ended up creating custom conversion methods (UTF8toUTF16 & vice-
versa) to do this conversion into a re-used byte[] or char[], which
grow as needed, then I just bulk-write the bytes. I think this is
not much slower than before (modified UTF8) since it also had to go
character by character w/ ifs inside that inner loop.
I'm less happy with the 11% slowdown on TermEnum, and that's even
with the optimization to incrementally decode only the "new" UTF-8
bytes as we are reading the changed suffix of each term, reusing the
already-decoded UTF16 chars from the previous term. This will
slowdown populating a FieldCache, which is already slow. But
LUCENE-831 and LUCENE-1231 should fix that.
Are the string diffs and comparisons now performed against raw
bytes, so that fewer conversions are needed?
Alas, not yet: Lucene still uses UTF16 java chars internally. The
conversion to UTF-8 happens "at the last minute" when writing, and
"immediately" when reading.
I started exploring keeping UTF-8 bytes further in, but it quickly
got messy because it would require changing how the term infos are
sorted to be unicode code point order. Comparing bytes in UTF-8 is
the same as comparing unicode code points, which is nice. But
comparing UTF-16 values is almost but not quite the same. So
suddenly everywhere where a string comparison takes place I had to
assess whether that comparison should be by unicode code point, and
call our own method for doing so. It quickly became a "big" project
so I ran back to sorting by UTF-16 value.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]