Re: bytecount as String and prefix length

Marvin Humphrey Mon, 31 Oct 2005 22:36:36 -0800


On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:

All of the JDK source is available via download from Sun.


Thanks.  I believe the UTF-8 coding algos can be found in...

j2se > src > share > classes > sun > nio > cs > UTF_8.java

It looks like the translator methods have fairly high loop overheads,since they have to keep track of the member variables of ByteBufferand CharBuffer objects and prepare to return result objects on eachloop iter. Also, they have robust error-checking for malformedsource data, which Lucene traditionally has not. The algo below mysig should be faster.


I wrote...

So my next step is to write a utf8ToString method that's as efficient
as I can make it.

Ok, this time we made a little headway. We're down from 20% slowerto around 10% slower indexing than current implementation. But Idon't see how I'm going to get it any faster. There's maybe oneconditional in FieldsReader that can be simplified.

There's another downside to the way I'm implementing this right now.The byteBuf and charBuf have to be kept somewhere. Currently, I'mallocating a ByteBuffer for each TermInfosWriter and a charBuf foreach TermBuffer. That's something of a memory hit, though it's hardto say exactly how much. IndexInput and IndexOutput are still usingthe Sun methods -- when I gave them Buffers, they slowed down.

I've got one more idea... time to try overriding readString andwriteString in BufferedIndexInput and BufferedIndexOutput, to takeadvantage of buffers that are already there.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

//----------------------------------------------------------------

  public static final CharBuffer utf8ToChars (
        byte[] bytes, int start, int length, CharBuffer charBuf) {
    int i = start;
    int j = 0;
    final int end = start + length;
    char[] chars = charBuf.array();
    try {
      while (i < end) {
        byte b = bytes[i++];
        switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
          case 0:
            chars[j++] = (char)(b & 0x7F);
            break;
          case 1:
            chars[j++] = (char)(((b & 0x1F) << 6)
              | (bytes[i++] & 0x3F));
            break;
          case 2:
            chars[j++] = (char)(((b & 0x0F) << 12)
              | ((bytes[i++] & 0x3F) << 6)
              |  (bytes[i++] & 0x3F));
            break;
          case 3:
            int utf32 = (((b & 0x0F) << 18)
              | ((bytes[i++] & 0x3F) << 12)
              | ((bytes[i++] & 0x3F) << 6)
              |  (bytes[i++] & 0x3F));
            chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
            i++;
            chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
            break;
        }
      }
    }
    catch (ArrayIndexOutOfBoundsException e) {
      float bytesProcessed = (float)(i - start);
      float bytesPerChar = (j / bytesProcessed) * 1.1f;

      float bytesLeft = length - bytesProcessed;

float targetSize = (float)chars.length + bytesPerChar *bytesLeft + 1.0f;return utf8ToChars(bytes, start, length, CharBuffer.allocate((int)targetSize));

    }
    charBuf.position(j);
    return charBuf;
  }



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bytecount as String and prefix length

Reply via email to