bytecount as String and prefix length

Marvin Humphrey Sun, 30 Oct 2005 14:39:58 -0800

Greets,

I've been experimenting with using the UTF-8 bytecount as the VIntcount at the top of Lucene's string format, as was discussed back inthe "Lucene does NOT use UTF-8" thread. Changes were made toIndexInput and IndexOutput as per some of Robert Engel'ssuggestions. Here's the implementation of writeString, which choosesthe memory hit of buffering over the performance hit of pre-scanning:


  public void writeString(String s) throws IOException {
    utf8Bytes = s.getBytes("UTF-8");
    int length = utf8Bytes.length;
    writeVInt(length);
    writeBytes(utf8Bytes, length);
  }

That, in conjunction with a similar implementation of readString andthe UTF-8-clean implementation of readChars I submitted a while back,executes 2-3% slower than the current implementation against myindex-1000-wikipedia-articles benchmarker.

I also had a hack at TermBuffer, TermInfosWriter, and StringHelper,in an attempt to convert the prefix length for the Term Dictionaryfrom chars to UTF-8 bytes. (Note that I've left off changingTermVectors for now.)

A major change has to be inflicted on TermBuffer in order for it towork with a bytecount: the char[] "text" buffer must be swapped outfor a byte[] "utf8Bytes" buffer. With that change in place,TermBuffer's 'read' method can utilize readBytes instead of readChars:


  public final void read(IndexInput input, FieldInfos fieldInfos)
    throws IOException {
    this.term = null;                           // invalidate cache
    int start = input.readVInt();
    int length = input.readVInt();
    int totalLength = start + length;
    setutf8Length(totalLength);
    input.readBytes(this.utf8Bytes, start, length);
    this.field = fieldInfos.fieldName(input.readVInt());
  }

I'd thought this might speed up scanning through a SegmentTermEnum,since the intermediate step of converting UTF-8 bytes to chars wasdeferred until toTerm gets called. However, a ScanEnum benchmarker Icooked up which calls next() a whole bunch indicated a wash.

Concerns were raised before about whether it would be necessary toconvert all strings to UTF-8 to calculate prefix length for the TermDictionary. Yes, it's necessary to convert them -- though the samecopy can be used as a byte buffer which obviates the need to callwriteChars.


  private final void writeTerm(Term term)
       throws IOException {
    byte[] bytes = term.text().getBytes("UTF-8");
    int totalLength = bytes.length;

    int start = StringHelper.bytesDifference(lastBytes, bytes);
    int length = totalLength - start;

output.writeVInt(start); // write sharedprefix length

    output.writeVInt(length);                  // write delta length
    for (int i = start ; i < totalLength; i++) {

output.writeByte(bytes[i]); // write delta UTF-8bytes

output.writeVInt(fieldInfos.fieldNumber(term.field)); // writefield num


    lastTerm = term;
    lastBytes = bytes;
  }

Unfortunately, once the changes to TermBuffer, TermInfosWriter, andStringHelper are applied, execution speed at index-time suffers aslowdown of about 20%. Perhaps this can be blamed on all the callsto getBytes("UTF-8") in TermInfosWriter? Maybe alternativeimplementations using ByteBuffer, CharsetDecoder, and CharsetEncoderare possible that can mitigate the problem?

A patch for all 5 files against repository revision 329640 is below;to apply it, cd to the trunk directory first.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/TermBuffer.java src/java/org/apache/lucene/index/TermBuffer.java--- src_old/java/org/apache/lucene/index/TermBuffer.java2005-10-30 14:30:23.000000000 -0800+++ src/java/org/apache/lucene/index/TermBuffer.java 2005-10-3013:27:47.000000000 -0800

@@ -20,40 +20,38 @@
import org.apache.lucene.store.IndexInput;
final class TermBuffer implements Cloneable {
-  private static final char[] NO_CHARS = new char[0];
+  private static final byte[] NO_BYTES = new byte[0];
   private String field;
-  private char[] text = NO_CHARS;
-  private int textLength;
+  private byte[] utf8Bytes = NO_BYTES;
+  private int utf8Length;
   private Term term;                            // cached
   public final int compareTo(TermBuffer other) {
     if (field == other.field)              // fields are interned

- return compareChars(text, textLength, other.text,other.textLength);

+      return compareUTF8(utf8Bytes, utf8Length,
+        other.utf8Bytes, other.utf8Length);
     else
       return field.compareTo(other.field);
   }
-  private static final int compareChars(char[] v1, int len1,
-                                        char[] v2, int len2) {
+  private static final int compareUTF8(byte[] v1, int len1,
+                                       byte[] v2, int len2) {
     int end = Math.min(len1, len2);
     for (int k = 0; k < end; k++) {
-      char c1 = v1[k];
-      char c2 = v2[k];
-      if (c1 != c2) {
-        return c1 - c2;
-      }
+      if (v1[k] != v2[k])
+        return v1[k] - v2[k];
     }
     return len1 - len2;
   }
-  private final void setTextLength(int newLength) {
-    if (text.length < newLength) {
-      char[] newText = new char[newLength];
-      System.arraycopy(text, 0, newText, 0, textLength);
-      text = newText;
+  private final void setutf8Length(int newLength) {
+    if (utf8Bytes.length < newLength) {
+      byte[] newBytes = new byte[newLength];
+      System.arraycopy(utf8Bytes, 0, newBytes, 0, utf8Length);
+      utf8Bytes = newBytes;
     }
-    textLength = newLength;
+    utf8Length = newLength;
   }
   public final void read(IndexInput input, FieldInfos fieldInfos)
@@ -62,28 +60,30 @@
     int start = input.readVInt();
     int length = input.readVInt();
     int totalLength = start + length;
-    setTextLength(totalLength);
-    input.readChars(this.text, start, length);
+    setutf8Length(totalLength);
+    input.readBytes(this.utf8Bytes, start, length);
     this.field = fieldInfos.fieldName(input.readVInt());
   }
-  public final void set(Term term) {
-    if (term == null) {
+  public final void set(Term t)  {
+    if (t == null) {
       reset();
       return;
     }
-    // copy text into the buffer
-    setTextLength(term.text().length());
-    term.text().getChars(0, term.text().length(), text, 0);
+    // convert chars into UTF-8 bytes, store in buffer
+    try {
+        utf8Bytes = t.text().getBytes("UTF-8");
+    } catch (java.io.UnsupportedEncodingException e) { }
+    setutf8Length(utf8Bytes.length);
-    this.field = term.field();
-    this.term = term;
+    this.field = t.field();
+    this.term = t;
   }
   public final void set(TermBuffer other) {
-    setTextLength(other.textLength);
-    System.arraycopy(other.text, 0, text, 0, textLength);
+    setutf8Length(other.utf8Length);
+    System.arraycopy(other.utf8Bytes, 0, utf8Bytes, 0, utf8Length);
     this.field = other.field;
     this.term = other.term;
@@ -91,7 +91,7 @@
   public void reset() {
     this.field = null;
-    this.textLength = 0;
+    this.utf8Length = 0;
     this.term = null;
   }
@@ -100,7 +100,9 @@
       return null;
     if (term == null)
-      term = new Term(field, new String(text, 0, textLength), false);
+    try {

+ term = new Term(field, new String(utf8Bytes, 0, utf8Length,"UTF-8"), false);

+    } catch (java.io.UnsupportedEncodingException e) { }
     return term;
   }
@@ -111,8 +113,8 @@
       clone = (TermBuffer)super.clone();
     } catch (CloneNotSupportedException e) {}
-    clone.text = new char[text.length];
-    System.arraycopy(text, 0, clone.text, 0, textLength);
+    clone.utf8Bytes = new byte[utf8Bytes.length];
+    System.arraycopy(utf8Bytes, 0, clone.utf8Bytes, 0, utf8Length);
     return clone;
   }

diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/TermInfosWriter.java src/java/org/apache/lucene/index/TermInfosWriter.java--- src_old/java/org/apache/lucene/index/TermInfosWriter.java2005-10-30 14:30:23.000000000 -0800+++ src/java/org/apache/lucene/index/TermInfosWriter.java2005-10-30 13:47:43.000000000 -0800

@@ -33,6 +33,8 @@
   private IndexOutput output;
   private Term lastTerm = new Term("", "");
   private TermInfo lastTi = new TermInfo();
+  private byte[] NO_BYTES = new byte[0];
+  private byte[] lastBytes = NO_BYTES;
   private long size = 0;

// TODO: the default values for these two parameters should besettable from

@@ -91,8 +93,10 @@

TermInfo pointers must be positive and greater than allprevious.*/

   final void add(Term term, TermInfo ti)
        throws IOException {
-    if (!isIndex && term.compareTo(lastTerm) <= 0)
-      throw new IOException("term out of order");
+    if (!isIndex && term.compareTo(lastTerm) <= 0) {
+      throw new IOException("term out of order" + lastTerm.field +
+      "--" + lastTerm.text + "--" + term.field + "--" + term.text);
+    }
     if (ti.freqPointer < lastTi.freqPointer)
       throw new IOException("freqPointer out of order");
     if (ti.proxPointer < lastTi.proxPointer)
@@ -121,16 +125,22 @@
   private final void writeTerm(Term term)
        throws IOException {

- int start = StringHelper.stringDifference(lastTerm.text,term.text);

-    int length = term.text.length() - start;
+    byte[] bytes = term.text().getBytes("UTF-8");
+    int totalLength = bytes.length;
+
+    int start = StringHelper.bytesDifference(lastBytes, bytes);
+    int length = totalLength - start;

output.writeVInt(start); // write sharedprefix length

     output.writeVInt(length);                  // write delta length
-    output.writeChars(term.text, start, length);  // write delta chars
+    for (int i = start ; i < totalLength; i++) {

+ output.writeByte(bytes[i]); // write delta UTF-8bytes

+    }

output.writeVInt(fieldInfos.fieldNumber(term.field)); // writefield num

     lastTerm = term;
+    lastBytes = bytes;
   }

diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/IndexInput.java src/java/org/apache/lucene/store/IndexInput.java--- src_old/java/org/apache/lucene/store/IndexInput.java2005-10-30 14:30:25.000000000 -0800+++ src/java/org/apache/lucene/store/IndexInput.java 2005-10-3013:27:47.000000000 -0800

@@ -23,7 +23,7 @@
  * @see Directory
  */
public abstract class IndexInput implements Cloneable {

- private char[] chars; // used byreadString()+ private byte[] utf8Bytes; // used byreadString()

   /** Reads and returns a single byte.
    * @see IndexOutput#writeByte(byte)
@@ -87,32 +87,66 @@
    */
   public String readString() throws IOException {
     int length = readVInt();
-    if (chars == null || length > chars.length)
-      chars = new char[length];
-    readChars(chars, 0, length);
-    return new String(chars, 0, length);
-  }
-
-  /** Reads UTF-8 encoded characters into an array.
+    if (utf8Bytes == null || length > utf8Bytes.length)
+      utf8Bytes = new byte[length];
+    readBytes(utf8Bytes, 0, length);
+    return new String(utf8Bytes, 0, length, "UTF-8");
+  }
+
+  static final byte[] TRAILING_BYTES_FOR_UTF8 = {
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+    3,3,3,3,3,3,3,3
+  };
+
+ /** Reads UTF-8 encoded characters into an array.
    * @param buffer the array to read characters into
    * @param start the offset in the array to start storing characters
    * @param length the number of characters to read
    * @see IndexOutput#writeChars(String,int,int)
    */
-  public void readChars(char[] buffer, int start, int length)
+  public void readChars(char[] buffer, int start, int length)
        throws IOException {
-    final int end = start + length;
+    final int end = start + length;
     for (int i = start; i < end; i++) {
-      byte b = readByte();
-      if ((b & 0x80) == 0)
-    buffer[i] = (char)(b & 0x7F);
-      else if ((b & 0xE0) != 0xE0) {
-    buffer[i] = (char)(((b & 0x1F) << 6)
-         | (readByte() & 0x3F));
-      } else
-    buffer[i] = (char)(((b & 0x0F) << 12)
-        | ((readByte() & 0x3F) << 6)
-            |  (readByte() & 0x3F));
+      byte b = readByte();
+      switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
+        case 0:
+          buffer[i] = (char)(b & 0x7F);
+          break;
+        case 1:
+          buffer[i] = (char)(((b & 0x1F) << 6)
+            | (readByte() & 0x3F));
+          break;
+        case 2:
+          buffer[i] = (char)(((b & 0x0F) << 12)
+            | ((readByte() & 0x3F) << 6)
+            |  (readByte() & 0x3F));
+          break;
+        case 3:
+          int utf32 = (((b & 0x0F) << 18)
+            | ((readByte() & 0x3F) << 12)
+            | ((readByte() & 0x3F) << 6)
+            |  (readByte() & 0x3F));
+          buffer[i] = (char)((utf32 >> 10) + 0xD7C0);
+          i++;
+          buffer[i] = (char)((utf32 & 0x03FF) + 0xDC00);
+          break;
+      }
     }
   }
@@ -148,7 +182,7 @@
       clone = (IndexInput)super.clone();
     } catch (CloneNotSupportedException e) {}
-    clone.chars = null;
+    clone.utf8Bytes = null;
     return clone;
   }

diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/IndexOutput.java src/java/org/apache/lucene/store/IndexOutput.java--- src_old/java/org/apache/lucene/store/IndexOutput.java2005-10-30 14:30:25.000000000 -0800+++ src/java/org/apache/lucene/store/IndexOutput.java 2005-10-3013:27:47.000000000 -0800

@@ -25,6 +25,8 @@
  */
public abstract class IndexOutput {

+ private byte[] utf8Bytes; // used bywriteString()

+
   /** Writes a single byte.
    * @see IndexInput#readByte()
    */
@@ -85,9 +87,11 @@
    * @see IndexInput#readString()
    */
   public void writeString(String s) throws IOException {
-    int length = s.length();
+    utf8Bytes = s.getBytes("UTF-8");
+
+    int length = utf8Bytes.length;
     writeVInt(length);
-    writeChars(s, 0, length);
+    writeBytes(utf8Bytes, length);
   }
   /** Writes a sequence of UTF-8 encoded characters from a string.

diff -u -r -X../excludefile src_old/java/org/apache/lucene/util/StringHelper.java src/java/org/apache/lucene/util/StringHelper.java--- src_old/java/org/apache/lucene/util/StringHelper.java2005-10-30 14:30:25.000000000 -0800+++ src/java/org/apache/lucene/util/StringHelper.java 2005-10-3013:48:07.000000000 -0800

@@ -44,7 +44,26 @@
     return len;
   }
-
+  /**
+   * Compares two byte[] arrays, element by element, and returns the
+   * number of elements common to both arrays.
+   *
+   * @param bytes1 The first byte[] to compare
+   * @param bytes2 The second byte[] to compare
+   * @return The number of common elements.
+   */

+ public static final int bytesDifference(byte[] bytes1, byte[]bytes2) {

+    int len1 = bytes1.length;
+    int len2 = bytes2.length;
+    int len = len1 < len2 ? len1 : len2;
+    for (int i = 0; i < len; i++) {
+      if (bytes1[i] != bytes2[i]) {
+          return i;
+      }
+    }
+    return len;
+  }
+
   private StringHelper() {
   }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

bytecount as String and prefix length

Reply via email to