Refactored FuzzyTermEnum

Bob Carpenter Tue, 13 Jun 2006 12:16:05 -0700


I refactored the org.apache.lucene.search.FuzzyTermEnum
edit distance implementation.  It now only uses a single
pair of arrays, and those never get resized.  That required
turning the order of text/target around in the loops.  You'll
see that with the pair of arrays method, they get re-used
hand-over-hand, and are assigned to local variables in the
tight loops.


I removed the calculation of min distance and replaced
it with a boolean -- the min wasn't needed, only the test vs.
the max.  I also flipped some variables around so there's
one less addition in the very inner loop and the arrays are
now looping in the ordinary way (starting at 0 with a < comparison).
I also commented out the redundant definition of the public close()
[which just called super.close() and had none of its own doc.]
I also just compute the max distance each time rather than
fiddling with an array -- it's just a little arithmetic done once
per term, but that could be put back.

I also rewrote min(int,int,int) to get rid of intermediate
assignments.  Is there a lib for this kind of thing?

An intermediate refactoring that does the hand-over-hand

with the existing array and resizing strategy is inFuzzyTermEnum.intermed.java.


I ran the unit tests as follows on both versions (my hat's off to the
build.xml author(s) -- this all just worked out of the box and was
really easy to follow the first through):

C:\java\lucene-2.0.0>ant -Djunit.includes="" -Dtestcase=TestFuzzyQuery test
Buildfile: build.xml
javacc-uptodate-check:
javacc-notice:
init:
common.compile-core:

[javac] Compiling 1 source file toC:\java\lucene-2.0.0\build\classes\java

compile-core:
compile-demo:
common.compile-test:
compile-test:
test:
   [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
   [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.453 sec
BUILD SUCCESSFUL
Total time: 2 seconds

Does anyone have regression/performance test harnesses?
The unit tests were pretty minimal (which is a good thing!).
It'd be nice to test the min impl (ternary vs. if/then)
and the assumption about not allocating an
array of max distances.  All told, the refactored version
should be a modest speed improvement, mainly from
unfolding the arrays so they're local one-dimensional refs.

I don't know what the protocol is for one-off contributions.
I'm happy with the Apache license, so that shouldn't
be a problem.  I also don't know whether you use tabs
or spaces -- I untabified the final version and used your
two-space format in emacs.

- Bob Carpenter

package org.apache.lucene.search;

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

import java.io.IOException;

/** Subclass of FilteredTermEnum for enumerating all terms that are similiar
 * to the specified filter term.
 *
 * <p>Term enumerations are always ordered by Term.compareTo().  Each term in
 * the enumeration is greater than all that precede it.
 */
public final class FuzzyTermEnum extends FilteredTermEnum {

  /* This should be somewhere around the average long word.
   * If it is longer, we waste time and space. If it is shorter, we waste a
   * little bit of time growing the array as we encounter longer words.
   */
  private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;

  /* Allows us save time required to create a new array
   * everytime similarity is called.  These are slices that
   * will be reused during dynamic programming hand-over-hand
   * style. 
   */
  private final int[] d0;
  private final int[] d1;    

  private float similarity;
  private boolean endEnum = false;

  private Term searchTerm = null;
  private final String field;
  private final String text;
  private final String prefix;

  private final float minimumSimilarity;
  private final float scale_factor;

  /**
   * Creates a FuzzyTermEnum with an empty prefix and a minSimilarity of 0.5f.
   * <p>
   * After calling the constructor the enumeration is already pointing to the 
first 
   * valid term if such a term exists. 
   * 
   * @param reader
   * @param term
   * @throws IOException
   * @see #FuzzyTermEnum(IndexReader, Term, float, int)
   */
  public FuzzyTermEnum(IndexReader reader, Term term) throws IOException {
    this(reader, term, FuzzyQuery.defaultMinSimilarity, 
FuzzyQuery.defaultPrefixLength);
  }
    
  /**
   * Creates a FuzzyTermEnum with an empty prefix.
   * <p>
   * After calling the constructor the enumeration is already pointing to the 
first 
   * valid term if such a term exists. 
   * 
   * @param reader
   * @param term
   * @param minSimilarity
   * @throws IOException
   * @see #FuzzyTermEnum(IndexReader, Term, float, int)
   */
  public FuzzyTermEnum(IndexReader reader, Term term, float minSimilarity) 
throws IOException {
    this(reader, term, minSimilarity, FuzzyQuery.defaultPrefixLength);
  }
    
  /**
   * Constructor for enumeration of all terms from specified 
<code>reader</code> which share a prefix of
   * length <code>prefixLength</code> with <code>term</code> and which have a 
fuzzy similarity &gt;
   * <code>minSimilarity</code>.
   * <p>
   * After calling the constructor the enumeration is already pointing to the 
first 
   * valid term if such a term exists. 
   * 
   * @param reader Delivers terms.
   * @param term Pattern term.
   * @param minSimilarity Minimum required similarity for terms from the 
reader. Default value is 0.5f.
   * @param prefixLength Length of required common prefix. Default value is 0.
   * @throws IOException
   */
  public FuzzyTermEnum(IndexReader reader, Term term, final float 
minSimilarity, final int prefixLength) throws IOException {
    super();
    
    if (minSimilarity >= 1.0f)
      throw new IllegalArgumentException("minimumSimilarity cannot be greater 
than or equal to 1");
    else if (minSimilarity < 0.0f)
      throw new IllegalArgumentException("minimumSimilarity cannot be less than 
0");
    if(prefixLength < 0)
      throw new IllegalArgumentException("prefixLength cannot be less than 0");

    this.minimumSimilarity = minSimilarity;
    this.scale_factor = 1.0f / (1.0f - minimumSimilarity);
    this.searchTerm = term;
    this.field = searchTerm.field();

    //The prefix could be longer than the word.
    //It's kind of silly though.  It means we must match the entire word.
    final int fullSearchTermLength = searchTerm.text().length();
    final int realPrefixLength = prefixLength > fullSearchTermLength ? 
fullSearchTermLength : prefixLength;

    this.text = searchTerm.text().substring(realPrefixLength);
    this.prefix = searchTerm.text().substring(0, realPrefixLength);

    this.d0 = new int[this.text.length()+1];
    this.d1 = new int[this.d0.length];

    setEnum(reader.terms(new Term(searchTerm.field(), prefix)));
  }

  /**
   * The termCompare method in FuzzyTermEnum uses Levenshtein distance to 
   * calculate the distance between the given term and the comparing term. 
   */
  protected final boolean termCompare(Term term) {
    if (field == term.field() && term.text().startsWith(prefix)) {
        final String target = term.text().substring(prefix.length());
        this.similarity = similarity(target);
        return (similarity > minimumSimilarity);
    }
    endEnum = true;
    return false;
  }
  
  public final float difference() {
    return (float)((similarity - minimumSimilarity) * scale_factor);
  }
  
  public final boolean endEnum() {
    return endEnum;
  }
  
  /******************************
   * Compute Levenshtein distance
   ******************************/
  
  /**
   * Finds and returns the smallest of three integers 
   */
  private static final int min(int a, int b, int c) {
      // removed assignments to use double ternary
      return (a < b)
          ? ((a < c) ? a : c)
          : ((b < c) ? b: c);

      // alt form is:
      // if (a < b) { if (a < c) return a; else return c; }
      // if (b < c) return b; else return c;
  }

  /**
   * <p>Similarity returns a number that is 1.0f or less (including negative 
numbers)
   * based on how similar the Term is compared to a target term.  It returns
   * exactly 0.0f when
   * <pre>
   *    editDistance &lt; maximumEditDistance</pre>
   * Otherwise it returns:
   * <pre>
   *    1 - (editDistance / length)</pre>
   * where length is the length of the shortest term (text or target) including 
a
   * prefix that are identical and editDistance is the Levenshtein distance for
   * the two words.</p>
   *
   * <p>Embedded within this algorithm is a fail-fast Levenshtein distance
   * algorithm.  The fail-fast algorithm differs from the standard Levenshtein
   * distance algorithm in that it is aborted if it is discovered that the
   * mimimum distance between the words is greater than some threshold.
   *
   * <p>To calculate the maximum distance threshold we use the following 
formula:
   * <pre>
   *     (1 - minimumSimilarity) * length</pre>
   * where length is the shortest term including any prefix that is not part of 
the
   * similarity comparision.  This formula was derived by solving for what 
maximum value
   * of distance returns false for the following statements:
   * <pre>
   *   similarity = 1 - ((float)distance / (float) (prefixLength + 
Math.min(textlen, targetlen)));
   *   return (similarity > minimumSimilarity);</pre>
   * where distance is the Levenshtein distance for the two words.
   * </p>
   * <p>Levenshtein distance (also known as edit distance) is a measure of 
similiarity
   * between two strings where the distance is measured as the number of 
character
   * deletions, insertions or substitutions required to transform one string to
   * the other string.
   * @param target the target word or phrase
   * @return the similarity,  0.0 or less indicates that it matches less than 
the required
   * threshold and 1.0 indicates that the text and target are identical
   */
  private synchronized final float similarity(final String target) {
    final int m = target.length();
    final int n = text.length();
    if (n == 0)  {
      //we don't have anything to compare.  That means if we just add
      //the letters for m we get the new word
      return prefix.length() == 0 ? 0.0f : 1.0f - ((float) m / prefix.length());
    }
    if (m == 0) {
      return prefix.length() == 0 ? 0.0f : 1.0f - ((float) n / prefix.length());
    }

    final int maxDistance = calculateMaxDistance(m);

    if (maxDistance < Math.abs(m-n)) {
      //just adding the characters of m to n or vice-versa results in
      //too many edits
      //for example "pre" length is 3 and "prefixes" length is 8.  We can see 
that
      //given this optimal circumstance, the edit distance cannot be less than 
5.
      //which is 8-3 or more precisesly Math.abs(3-8).
      //if our maximum edit distance is 4, then we can discard this word
      //without looking at it.
      return 0.0f;
    }

    int[] dLast = d0;  // set locals for efficiency
    int[] dCurrent = d1; 
    for (int j = 0; j <= n; j++) dCurrent[j] = j;

    for (int i = 0; i < m; ) {
      final char s_i = target.charAt(i);
      int[] dTemp = dLast;
      dLast = dCurrent;    // previously: d[i-i]
      dCurrent = dTemp;    // previously: d[i]
      boolean prune = (dCurrent[0] = ++i) > maxDistance; // true if d[i][0] is 
too large
      for (int j = 0; j < n; j++) {
        dCurrent[j+1] = (s_i == text.charAt(j))
            ? min(dLast[j+1]+1, dCurrent[j]+1, dLast[j])
            : min(dLast[j+1], dCurrent[j], dLast[j])+1;
        if (prune && dCurrent[j+1] <= maxDistance) 
            prune = false;
      }

      // (prune==false) iff (dCurrent[j] < maxDistance) for some j
      if (prune) {
          return 0.0f;
      }
    }
    
    // this will return less than 0.0 when the edit distance is
    // greater than the number of characters in the shorter word.
    // but this was the formula that was previously used in FuzzyTermEnum,
    // so it has not been changed (even though minimumSimilarity must be
    // greater than 0.0)
    return 1.0F - dCurrent[n]/(float)(prefix.length() + Math.min(n,m));
  }

  private int calculateMaxDistance(int m) {
    return (int) ((1-minimumSimilarity) * (Math.min(text.length(), m) + 
prefix.length()));
  }

    /* This is redundant
  public void close() throws IOException {
    super.close();  //call super.close() and let the garbage collector do its 
work.
  }
    */
  

}

package org.apache.lucene.search;

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

import java.io.IOException;

/** Subclass of FilteredTermEnum for enumerating all terms that are similiar
 * to the specified filter term.
 *
 * <p>Term enumerations are always ordered by Term.compareTo().  Each term in
 * the enumeration is greater than all that precede it.
 */
public final class FuzzyTermEnum extends FilteredTermEnum {

  /* This should be somewhere around the average long word.
   * If it is longer, we waste time and space. If it is shorter, we waste a
   * little bit of time growing the array as we encounter longer words.
   */
  private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;

  /* Allows us save time required to create a new array
   * everytime similarity is called.  These are slices that
   * will be reused during dynamic programming hand-over-hand
   * style.  They get resized, if necessary, by growDistanceArrays(int).
   */
  private int[] d0;
  private int[] d1;    

  private float similarity;
  private boolean endEnum = false;

  private Term searchTerm = null;
  private final String field;
  private final String text;
  private final String prefix;

  private final float minimumSimilarity;
  private final float scale_factor;

  /**
   * Creates a FuzzyTermEnum with an empty prefix and a minSimilarity of 0.5f.
   * <p>
   * After calling the constructor the enumeration is already pointing to the 
first 
   * valid term if such a term exists. 
   * 
   * @param reader
   * @param term
   * @throws IOException
   * @see #FuzzyTermEnum(IndexReader, Term, float, int)
   */
  public FuzzyTermEnum(IndexReader reader, Term term) throws IOException {
    this(reader, term, FuzzyQuery.defaultMinSimilarity, 
FuzzyQuery.defaultPrefixLength);
  }
    
  /**
   * Creates a FuzzyTermEnum with an empty prefix.
   * <p>
   * After calling the constructor the enumeration is already pointing to the 
first 
   * valid term if such a term exists. 
   * 
   * @param reader
   * @param term
   * @param minSimilarity
   * @throws IOException
   * @see #FuzzyTermEnum(IndexReader, Term, float, int)
   */
  public FuzzyTermEnum(IndexReader reader, Term term, float minSimilarity) 
throws IOException {
    this(reader, term, minSimilarity, FuzzyQuery.defaultPrefixLength);
  }
    
  /**
   * Constructor for enumeration of all terms from specified 
<code>reader</code> which share a prefix of
   * length <code>prefixLength</code> with <code>term</code> and which have a 
fuzzy similarity &gt;
   * <code>minSimilarity</code>.
   * <p>
   * After calling the constructor the enumeration is already pointing to the 
first 
   * valid term if such a term exists. 
   * 
   * @param reader Delivers terms.
   * @param term Pattern term.
   * @param minSimilarity Minimum required similarity for terms from the 
reader. Default value is 0.5f.
   * @param prefixLength Length of required common prefix. Default value is 0.
   * @throws IOException
   */
  public FuzzyTermEnum(IndexReader reader, Term term, final float 
minSimilarity, final int prefixLength) throws IOException {
    super();
    
    if (minSimilarity >= 1.0f)
      throw new IllegalArgumentException("minimumSimilarity cannot be greater 
than or equal to 1");
    else if (minSimilarity < 0.0f)
      throw new IllegalArgumentException("minimumSimilarity cannot be less than 
0");
    if(prefixLength < 0)
      throw new IllegalArgumentException("prefixLength cannot be less than 0");

    this.minimumSimilarity = minSimilarity;
    this.scale_factor = 1.0f / (1.0f - minimumSimilarity);
    this.searchTerm = term;
    this.field = searchTerm.field();

    //The prefix could be longer than the word.
    //It's kind of silly though.  It means we must match the entire word.
    final int fullSearchTermLength = searchTerm.text().length();
    final int realPrefixLength = prefixLength > fullSearchTermLength ? 
fullSearchTermLength : prefixLength;

    this.text = searchTerm.text().substring(realPrefixLength);
    this.prefix = searchTerm.text().substring(0, realPrefixLength);

    growDistanceArrays(TYPICAL_LONGEST_WORD_IN_INDEX);

    setEnum(reader.terms(new Term(searchTerm.field(), prefix)));
  }

  /**
   * The termCompare method in FuzzyTermEnum uses Levenshtein distance to 
   * calculate the distance between the given term and the comparing term. 
   */
  protected final boolean termCompare(Term term) {
    if (field == term.field() && term.text().startsWith(prefix)) {
        final String target = term.text().substring(prefix.length());
        this.similarity = similarity(target);
        return (similarity > minimumSimilarity);
    }
    endEnum = true;
    return false;
  }
  
  public final float difference() {
    return (float)((similarity - minimumSimilarity) * scale_factor);
  }
  
  public final boolean endEnum() {
    return endEnum;
  }
  
  /******************************
   * Compute Levenshtein distance
   ******************************/
  
  /**
   * Finds and returns the smallest of three integers 
   */
  private static final int min(int a, int b, int c) {
      // removed assignments to use double ternary
      return (a < b)
          ? ((a < c) ? a : c)
          : ((b < c) ? b: c);

      // alt form is:
      // if (a < b) { if (a < c) return a; else return c; }
      // if (b < c) return b; else return c;
  }

  /**
   * <p>Similarity returns a number that is 1.0f or less (including negative 
numbers)
   * based on how similar the Term is compared to a target term.  It returns
   * exactly 0.0f when
   * <pre>
   *    editDistance &lt; maximumEditDistance</pre>
   * Otherwise it returns:
   * <pre>
   *    1 - (editDistance / length)</pre>
   * where length is the length of the shortest term (text or target) including 
a
   * prefix that are identical and editDistance is the Levenshtein distance for
   * the two words.</p>
   *
   * <p>Embedded within this algorithm is a fail-fast Levenshtein distance
   * algorithm.  The fail-fast algorithm differs from the standard Levenshtein
   * distance algorithm in that it is aborted if it is discovered that the
   * mimimum distance between the words is greater than some threshold.
   *
   * <p>To calculate the maximum distance threshold we use the following 
formula:
   * <pre>
   *     (1 - minimumSimilarity) * length</pre>
   * where length is the shortest term including any prefix that is not part of 
the
   * similarity comparision.  This formula was derived by solving for what 
maximum value
   * of distance returns false for the following statements:
   * <pre>
   *   similarity = 1 - ((float)distance / (float) (prefixLength + 
Math.min(textlen, targetlen)));
   *   return (similarity > minimumSimilarity);</pre>
   * where distance is the Levenshtein distance for the two words.
   * </p>
   * <p>Levenshtein distance (also known as edit distance) is a measure of 
similiarity
   * between two strings where the distance is measured as the number of 
character
   * deletions, insertions or substitutions required to transform one string to
   * the other string.
   * @param target the target word or phrase
   * @return the similarity,  0.0 or less indicates that it matches less than 
the required
   * threshold and 1.0 indicates that the text and target are identical
   */
  private synchronized final float similarity(final String target) {
    final int m = target.length();
    final int n = text.length();
    if (n == 0)  {
      //we don't have anything to compare.  That means if we just add
      //the letters for m we get the new word
      return prefix.length() == 0 ? 0.0f : 1.0f - ((float) m / prefix.length());
    }
    if (m == 0) {
      return prefix.length() == 0 ? 0.0f : 1.0f - ((float) n / prefix.length());
    }

    final int maxDistance = calculateMaxDistance(m);

    if (maxDistance < Math.abs(m-n)) {
      //just adding the characters of m to n or vice-versa results in
      //too many edits
      //for example "pre" length is 3 and "prefixes" length is 8.  We can see 
that
      //given this optimal circumstance, the edit distance cannot be less than 
5.
      //which is 8-3 or more precisesly Math.abs(3-8).
      //if our maximum edit distance is 4, then we can discard this word
      //without looking at it.
      return 0.0f;
    }

    //let's make sure we have enough room in our array to do the distance 
calculations.
    if (d0.length <= m) {
        growDistanceArrays(m);
    }

    int[] dLast = d0;  // set local vars for efficiency ~ the old d[i-1]
    int[] dCurrent = d1;  //                            ~ the old d[i]
    for (int j = 0; j <= m; j++) dCurrent[j] = j;

    for (int i = 0; i < n; ) {
        final char s_i = text.charAt(i);
        int[] dTemp = dLast;
        dLast = dCurrent;    // previously: d[i-i]
        dCurrent = dTemp;    // previously: d[i]
        boolean prune = (dCurrent[0] = ++i) > maxDistance; // true if d[i][0] 
is too large
        for (int j = 0; j < m; j++) {
            dCurrent[j+1] = (s_i == target.charAt(j))
                ? min(dLast[j+1]+1, dCurrent[j]+1, dLast[j])
                : min(dLast[j+1], dCurrent[j], dLast[j])+1;
            if (prune && dCurrent[j+1] <= maxDistance) 
                prune = false;
        }

        // (prune==false) iff (dCurrent[j] < maxDistance) for some j
        if (prune) {
            return 0.0f;
        }
    }

    // this will return less than 0.0 when the edit distance is
    // greater than the number of characters in the shorter word.
    // but this was the formula that was previously used in FuzzyTermEnum,
    // so it has not been changed (even though minimumSimilarity must be
    // greater than 0.0)
    return 1.0F - dCurrent[m]/(float)(prefix.length() + Math.min(n,m));
  }


  /**
   * Grow the second dimension of the array slices, so that we can
   * calculate the Levenshtein difference.
   */
  private void growDistanceArrays(int m) {
      d0 = new int[m+1];
      d1 = new int[m+1];
  }

  private int calculateMaxDistance(int m) {
    return (int) ((1-minimumSimilarity) * (Math.min(text.length(), m) + 
prefix.length()));
  }

    /* This is redundant
  public void close() throws IOException {
    super.close();  //call super.close() and let the garbage collector do its 
work.
  }
    */
  

}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Refactored FuzzyTermEnum

Reply via email to