DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=31882>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=31882 [PATCH] FuzzyTermEnum optimization and refactor ------- Additional Comments From [EMAIL PROTECTED] 2004-10-26 22:58 ------- The "synchronized" was added to the similarity() because it uses the d[][] across calls to it. If similarity() was called on the SAME FuzzyTermEnum object by two different threads, the results could not be guaranteed. How lucene uses FuzzyTermEnum, this will never happen. Currently, Lucene creates it and let's it be garbage collected within a single method. This is more defensive programming than anything else. The TYPICAL_LONGEST_WORD_IN_INDEX is used in two different context. The first context is to initializing the d[][] to something smarter than [x][1]. It should save very little time as d[][] will quickly grow to the appropriate size. The second context is to create a lookup table for the max distance, so that it does not have to recalculate what the max distance is for every word. The max distance is a constant for every word of the same length. I also only saw marginal gain from this change. I went back to estimate the acutal amount of time saved from this change in isolation. On a modern machine, it is less than 200ms for 10 million words (see test below). public class Scratch { private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19; private static final float[] TYPICAL_WORD_LENGTH_DISTRIBUTION = { 0, 0.01f, 0.015f, 0.025f, 0.05f, 0.10f, 0.125f, 0.15f, 0.155f, 0.12f, 0.10f, 0.07f, 0.04f, 0.015f, 0.010f, 0.005f, 0.005f, 0.005f }; private final int[] maxDistances = new int[TYPICAL_LONGEST_WORD_IN_INDEX]; public Scratch() { for (int i = 0; i < maxDistances.length; i++) { maxDistances[i] = calculateMaxDistance(i); } } private final int getMaxDistance(int m){ return (m < maxDistances.length) ? maxDistances[m] : calculateMaxDistance(m); } private int calculateMaxDistance(int m){ return (int) ((1-0.5) * (Math.min(7, m) + 0)); } public static void main(String[] args) { long start = System.currentTimeMillis(); Scratch s = new Scratch(); final int totalNumberOfRuns = 10000000; for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++) { float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i]; final int runsForASingleLength = (int) (v * totalNumberOfRuns); //System.out.println(runsForASingleLength); for (int j=0; j < runsForASingleLength; j++) { s.getMaxDistance(i); } } long total = System.currentTimeMillis() - start; System.out.println("Time to lookup the records (in ms): " + total); long start2 = System.currentTimeMillis(); for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++) { float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i]; final int runsForASingleLength = (int) (v * totalNumberOfRuns); for (int j=0; j < runsForASingleLength; j++) { s.calculateMaxDistance(i); } } long total2 = System.currentTimeMillis() - start2; System.out.println("Time to recalculate the records (in ms): " + total2); } } --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]