On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:
All of the JDK source is available via download from Sun.
Thanks. I believe the UTF-8 coding algos can be found in...
j2se > src > share > classes > sun > nio > cs > UTF_8.java
It looks like the translator methods have fairly high loop overheads,
since they have to keep track of the member variables of ByteBuffer
and CharBuffer objects and prepare to return result objects on each
loop iter. Also, they have robust error-checking for malformed
source data, which Lucene traditionally has not. The algo below my
sig should be faster.
I wrote...
So my next step is to write a utf8ToString method that's as efficient
as I can make it.
Ok, this time we made a little headway. We're down from 20% slower
to around 10% slower indexing than current implementation. But I
don't see how I'm going to get it any faster. There's maybe one
conditional in FieldsReader that can be simplified.
There's another downside to the way I'm implementing this right now.
The byteBuf and charBuf have to be kept somewhere. Currently, I'm
allocating a ByteBuffer for each TermInfosWriter and a charBuf for
each TermBuffer. That's something of a memory hit, though it's hard
to say exactly how much. IndexInput and IndexOutput are still using
the Sun methods -- when I gave them Buffers, they slowed down.
I've got one more idea... time to try overriding readString and
writeString in BufferedIndexInput and BufferedIndexOutput, to take
advantage of buffers that are already there.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
//----------------------------------------------------------------
public static final CharBuffer utf8ToChars (
byte[] bytes, int start, int length, CharBuffer charBuf) {
int i = start;
int j = 0;
final int end = start + length;
char[] chars = charBuf.array();
try {
while (i < end) {
byte b = bytes[i++];
switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
case 0:
chars[j++] = (char)(b & 0x7F);
break;
case 1:
chars[j++] = (char)(((b & 0x1F) << 6)
| (bytes[i++] & 0x3F));
break;
case 2:
chars[j++] = (char)(((b & 0x0F) << 12)
| ((bytes[i++] & 0x3F) << 6)
| (bytes[i++] & 0x3F));
break;
case 3:
int utf32 = (((b & 0x0F) << 18)
| ((bytes[i++] & 0x3F) << 12)
| ((bytes[i++] & 0x3F) << 6)
| (bytes[i++] & 0x3F));
chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
i++;
chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
break;
}
}
}
catch (ArrayIndexOutOfBoundsException e) {
float bytesProcessed = (float)(i - start);
float bytesPerChar = (j / bytesProcessed) * 1.1f;
float bytesLeft = length - bytesProcessed;
float targetSize = (float)chars.length + bytesPerChar *
bytesLeft + 1.0f;
return utf8ToChars(bytes, start, length, CharBuffer.allocate
((int)targetSize));
}
charBuf.position(j);
return charBuf;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]