On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:

All of the JDK source is available via download from Sun.

Thanks.  I believe the UTF-8 coding algos can be found in...

j2se > src > share > classes > sun > nio > cs > UTF_8.java

It looks like the translator methods have fairly high loop overheads, since they have to keep track of the member variables of ByteBuffer and CharBuffer objects and prepare to return result objects on each loop iter. Also, they have robust error-checking for malformed source data, which Lucene traditionally has not. The algo below my sig should be faster.

I wrote...

So my next step is to write a utf8ToString method that's as efficient
as I can make it.

Ok, this time we made a little headway. We're down from 20% slower to around 10% slower indexing than current implementation. But I don't see how I'm going to get it any faster. There's maybe one conditional in FieldsReader that can be simplified.

There's another downside to the way I'm implementing this right now. The byteBuf and charBuf have to be kept somewhere. Currently, I'm allocating a ByteBuffer for each TermInfosWriter and a charBuf for each TermBuffer. That's something of a memory hit, though it's hard to say exactly how much. IndexInput and IndexOutput are still using the Sun methods -- when I gave them Buffers, they slowed down.

I've got one more idea... time to try overriding readString and writeString in BufferedIndexInput and BufferedIndexOutput, to take advantage of buffers that are already there.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

//----------------------------------------------------------------

  public static final CharBuffer utf8ToChars (
        byte[] bytes, int start, int length, CharBuffer charBuf) {
    int i = start;
    int j = 0;
    final int end = start + length;
    char[] chars = charBuf.array();
    try {
      while (i < end) {
        byte b = bytes[i++];
        switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
          case 0:
            chars[j++] = (char)(b & 0x7F);
            break;
          case 1:
            chars[j++] = (char)(((b & 0x1F) << 6)
              | (bytes[i++] & 0x3F));
            break;
          case 2:
            chars[j++] = (char)(((b & 0x0F) << 12)
              | ((bytes[i++] & 0x3F) << 6)
              |  (bytes[i++] & 0x3F));
            break;
          case 3:
            int utf32 = (((b & 0x0F) << 18)
              | ((bytes[i++] & 0x3F) << 12)
              | ((bytes[i++] & 0x3F) << 6)
              |  (bytes[i++] & 0x3F));
            chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
            i++;
            chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
            break;
        }
      }
    }
    catch (ArrayIndexOutOfBoundsException e) {
      float bytesProcessed = (float)(i - start);
      float bytesPerChar = (j / bytesProcessed) * 1.1f;

      float bytesLeft = length - bytesProcessed;
float targetSize = (float)chars.length + bytesPerChar * bytesLeft + 1.0f; return utf8ToChars(bytes, start, length, CharBuffer.allocate ((int)targetSize));
    }
    charBuf.position(j);
    return charBuf;
  }



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to