DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=31882>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=31882

[PATCH] FuzzyTermEnum optimization and refactor





------- Additional Comments From [EMAIL PROTECTED]  2004-10-26 22:58 -------
The "synchronized" was added to the similarity() because it uses the d[][]
across calls to it.  If similarity() was called on the SAME FuzzyTermEnum object
by two different threads, the results could not be guaranteed.  How lucene uses
FuzzyTermEnum, this will never happen.  Currently, Lucene creates it and let's
it be garbage collected within a single method.  This is more defensive
programming than anything else.

The TYPICAL_LONGEST_WORD_IN_INDEX is used in two different context.  The first
context is to initializing the d[][] to something smarter than [x][1].  It
should save very little time as d[][] will quickly grow to the appropriate size.
 The second context is to create a lookup table for the max distance, so that it
does not have to recalculate what the max distance is for every word.  The max
distance is a constant for every word of the same length.  I also only saw
marginal gain from this change.  

I went back to estimate the acutal amount of time saved from this change in
isolation.  On a modern machine, it is less than 200ms for 10 million words (see
test below).  

public class Scratch
{
  private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;
  private static final float[] TYPICAL_WORD_LENGTH_DISTRIBUTION = {
   0, 0.01f, 0.015f, 0.025f, 0.05f, 0.10f, 0.125f, 0.15f, 0.155f, 0.12f, 0.10f,
0.07f, 0.04f, 0.015f, 0.010f, 0.005f, 0.005f, 0.005f
  };

  private final int[] maxDistances = new int[TYPICAL_LONGEST_WORD_IN_INDEX];

  public Scratch()
  {
    for (int i = 0; i < maxDistances.length; i++)
    {
      maxDistances[i] = calculateMaxDistance(i);
    }
  }

  private final int getMaxDistance(int m){
    return (m < maxDistances.length) ? maxDistances[m] : calculateMaxDistance(m);
  }

  private int calculateMaxDistance(int m){
    return (int) ((1-0.5) * (Math.min(7, m) + 0));
  }

  public static void main(String[] args)
  {    
    long start = System.currentTimeMillis();
    Scratch s = new Scratch();
    final int totalNumberOfRuns = 10000000;
    for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
    {
      float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
      final int runsForASingleLength = (int) (v * totalNumberOfRuns);
      //System.out.println(runsForASingleLength);
      for (int j=0; j < runsForASingleLength; j++) {
        s.getMaxDistance(i);
      }
    }

    long total = System.currentTimeMillis() - start;

    System.out.println("Time to lookup the records (in ms): " + total);

    long start2 = System.currentTimeMillis();

    for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
    {
      float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
      final int runsForASingleLength = (int) (v * totalNumberOfRuns);
      for (int j=0; j < runsForASingleLength; j++) {
        s.calculateMaxDistance(i);
      }
    }

     long total2 = System.currentTimeMillis() - start2;

    System.out.println("Time to recalculate the records (in ms): " + total2);
  }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to