Optimizing SegmentTermEnum (and friends)

Dmitry Serebrennikov Tue, 25 Feb 2003 11:36:50 -0800

Greetings,

I've been running Lucene (inside of our application) in OptimizeIt to see if I can improve garbage generation and performance metrics of our app. Let me say right a way that Lucene is great! :) The more experience I have with it the more I find how well it performs. However, perhaps there are a few places still left in Lucene that could use further optimization. Particularly, I've been looking at SegmentTermEnum, since our app does a lot of what I call "term queries" - queries that result in finding matching terms rather than matching documents.

It seems that there are at least two other threads right now on the list that are discussing similar types of queries, but I'm not too familiar with those particular APIs. What we do is instantiate TermEnums and than iterate through them in various ways. For example, the following code fragment is what I used for testing. It is similar to some of the queries that we run. I also think that any type of "starts-with" wildcard query would also experience the same issues.

   private static void run2(IndexReader ir) throws Exception {
       for (int i=0; i<10; i++) {
           for (char c1='a'; c1 < 'z'; c1++) {
               for (char c2='a'; c2 < 'z'; c2++) {
                  Term starter = new Term(FIELD, "" + c1 + c2 );
                  TermEnum terms = ir.terms(starter);
                  if (terms.isValid()) {
                      Term target = terms.term();
                      System.out.println(target);
                  }
               }
           }
       }
   }

The main source of garbage when running this code appears to be in SegmentTermEnum, which insists on creating Term instances (with the Strings and char[]s that go into them) for each term that it looks at. The typical pattern is that it picks a place in the term enum using the term index that is near the term I am requesting, seek()s to it, and then proceeds to retrieve terms sequentially until it gets to the term that I need.

Now, to the question / proposal part: 1) Since I do not need the intermediate terms, it makes sence to try to have a method that skips to the right term without creating the intermediate Term objects. I have done a version of this yesterday and ended up seeing a factor of 2 performance encrease and a factor of 2 garbage reduction. The patch adds the following method to Term.java: final int compareTo(String otherField, char[] otherText, int start, int len) And changes SegmentTermEnum.java to delay creation of Term object until call to term(). Full diff is attached. Any comments are welcome, especially if I've missed something.

2) So far, so good, but perhaps we can do better? I am looking for other suggestions to implement around this functionality. For example, the above change helps with "starts-with" queries, but will do nothing for "contains" types of queries. Perhaps a better way to do this is to provide a way for users to supply a TermPredicate that would be evaluated against the term data without having to construct a Term object for that. The skipTo(TermPredicate) would then stop at the next matching term or the end of the enum.

3) I found a piece of code in TermInfosReader.java that uses a field SegmentTermEnum.prev to try to optimize seeks. It looks like this code was put in after the original SegmentTermEnum was finished. I can't find any record of this change in Jakarta's CVS, so probably it was done prior to moving to Jakarta. Does anyone remember why this is here? Does it actually serve a useful purpose? It seems that the condition this code is testing for would not really occur. Perhaps I'm missing something. Here's the code fragment that uses the .prev field:

/** Returns the TermInfo for a Term in the set, or null. */ final synchronized TermInfo get(Term term) throws IOException { if (size == 0) return null; // optimize sequential access: first try scanning cached enum w/o seeking if (enum.term() != null // term is at or past current && ((enum.prev != null && term.compareTo(enum.prev) > 0) || term.compareTo(enum.term()) >= 0)) { int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1; if (indexTerms.length == enumOffset // but before end of block || term.compareTo(indexTerms[enumOffset]) < 0) return scanEnum(term); // no need to seek } // random-access: must seek seekEnum(getIndexOffset(term)); return scanEnum(term); }

Index: Term.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/Term.java,v
retrieving revision 1.4
diff -B -u -w -r1.4 Term.java
--- Term.java   13 Jan 2003 23:50:33 -0000      1.4
+++ Term.java   25 Feb 2003 19:16:25 -0000
@@ -110,6 +110,22 @@
       return field.compareTo(other.field);
   }
 
+  /** Compares this term to a term represented as a field (interned) and 
+   *  a fragment of a character array containing the term's text.
+   */
+  final int compareTo(String otherField, char[] otherText, int start, int len) {
+    if (field == otherField) {
+        int end = Math.min(text.length(), len);
+        for (int i=0; i < end; i++) {
+            int res = text.charAt(i) - otherText[start + i];
+            if (res != 0) return res;
+        }
+        return text.length() - len;
+    } else {
+        return field.compareTo(otherField);
+    }
+  }
+  
   /** Resets the field and text of a Term. */
   final void set(String fld, String txt) {
     field = fld;
Index: SegmentTermEnum.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentTermEnum.java,v
retrieving revision 1.2
diff -B -u -w -r1.2 SegmentTermEnum.java
--- SegmentTermEnum.java        11 Oct 2001 15:14:14 -0000      1.2
+++ SegmentTermEnum.java        25 Feb 2003 19:19:28 -0000
@@ -63,7 +63,6 @@
   int size;
   int position = -1;
 
-  private Term term = new Term("", "");
   private TermInfo termInfo = new TermInfo();
 
   boolean isIndex = false;
@@ -71,6 +70,9 @@
   Term prev;
 
   private char[] buffer = {};
+  private String currentTermField;
+  private int currentTermLength = -1;
+  private Term cachedTerm; // only created when asked for
 
   SegmentTermEnum(InputStream i, FieldInfos fis, boolean isi)
        throws IOException {
@@ -88,7 +90,10 @@
 
     clone.input = (InputStream)input.clone();
     clone.termInfo = new TermInfo(termInfo);
-    if (term != null) clone.growBuffer(term.text.length());
+    clone.currentTermField = currentTermField;
+    clone.currentTermLength = currentTermLength;
+    if (buffer != null)
+        clone.buffer = (char[]) buffer.clone();
 
     return clone;
   }
@@ -97,21 +102,56 @@
        throws IOException {
     input.seek(pointer);
     position = p;
-    term = t;
+    
     prev = null;
     termInfo.set(ti);
-    growBuffer(term.text.length());              // copy term text into buffer
+    currentTermField = t.field;
+    currentTermLength = t.text.length();
+    cachedTerm = null;
+    buffer = t.text.toCharArray();
   }
 
+  
+  /** Skips terms to the first beyond the current whose value is
+   * greater or equal to <i>target</i>. <p>Returns true iff there is such
+   * an entry.  <p>Behaves as if written: <pre>
+   *   public boolean skipTo(Term target) {
+   *     do {
+   *       if (!next())
+   *        return false;
+   *     } while (target > term());
+   *     return true;
+   *   }
+   * </pre>
+   * Some implementations are considerably more efficient than that.
+   */
+  public final boolean skipTo(Term target) throws IOException {
+    prev = null;
+    
+    do {
+        if (! next()) return false;
+    } while (target.compareTo(currentTermField, buffer, 0, currentTermLength) > 0);
+    
+    return true;
+  }
+
+  
   /** Increments the enumeration to the next element.  True if one exists.*/
   public final boolean next() throws IOException {
+    cachedTerm = null;
+  
     if (position++ >= size-1) {
-      term = null;
+      currentTermLength = -1;
       return false;
     }
 
-    prev = term;
-    term = readTerm();
+    int start = input.readVInt();
+    int length = input.readVInt();
+    currentTermLength = start + length;
+    growBuffer(currentTermLength);
+
+    input.readChars(buffer, start, length);
+    currentTermField = fieldInfos.fieldName(input.readVInt());
 
     termInfo.docFreq = input.readVInt();         // read doc freq
     termInfo.freqPointer += input.readVLong();   // read freq pointer
@@ -123,28 +163,26 @@
     return true;
   }
 
-  private final Term readTerm() throws IOException {
-    int start = input.readVInt();
-    int length = input.readVInt();
-    int totalLength = start + length;
-    if (buffer.length < totalLength)
-      growBuffer(totalLength);
-    
-    input.readChars(buffer, start, length);
-    return new Term(fieldInfos.fieldName(input.readVInt()),
-                   new String(buffer, 0, totalLength), false);
-  }
 
   private final void growBuffer(int length) {
-    buffer = new char[length];
-    for (int i = 0; i < term.text.length(); i++)  // copy contents
-      buffer[i] = term.text.charAt(i);
+    if (buffer == null || buffer.length < length) {
+        char tmp[] = new char[length];
+        System.arraycopy(buffer, 0, tmp, 0, buffer.length);
+        buffer = tmp;
+    }
   }
 
   /** Returns the current Term in the enumeration.
     Initially invalid, valid after next() called for the first time.*/
   public final Term term() {
-    return term;
+    if (currentTermLength >= 0) {
+        if (cachedTerm == null) {
+            cachedTerm = new Term(currentTermField, new String(buffer, 0, 
currentTermLength), false);
+        }
+        return cachedTerm;
+    } else {
+        return null;
+    }
   }
 
   /** Returns the current TermInfo in the enumeration.
Index: TermInfosReader.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v
retrieving revision 1.2
diff -B -w -u -r1.2 TermInfosReader.java
--- TermInfosReader.java        29 Jan 2003 17:18:54 -0000      1.2
+++ TermInfosReader.java        25 Feb 2003 19:21:11 -0000
@@ -162,7 +162,7 @@
   
   /** Scans within block for matching term. */
   private final TermInfo scanEnum(Term term) throws IOException {
-    while (term.compareTo(enum.term()) > 0 && enum.next()) {}
+    if (term.compareTo(enum.term()) > 0) enum.skipTo(term);
     if (enum.term() != null && term.compareTo(enum.term()) == 0)
       return enum.termInfo();
     else

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Optimizing SegmentTermEnum (and friends)

Reply via email to