Sorry Erik, I did not really get this about similarity, what/why should it be done?
How I see current implementation of the fuzzy query goes something like this: FuzzyTermEnum has two responsibilities: 1. To filter-out terms not similar enough to the search term (FUZZY_THRESHOLD) 2. To actually calculate how big is the difference() This calculated difference gets introduced into the big picture via setBoost() in the MultiTermQuery.rewrite(). Probably the way this difference is used in the similarity calculation could be placed somewhere in the Similarity using the other method than setBoost(), but actual calculation of the distance is one level lower than this. At a moment I have no better ideas than to leave this as it is (via setBoost()). What Robert suggested sounds pretty reasonable to me, with this we would be able to: - Have different distance function for each field - Clean way for user to implement new distance function and than simply map it to the field name - In the same go I would suggest FUZZY_TRESHOLD/ SCALE_FACTOR to become changeable at the field level, now this is fixed. - Probably introduce parameter that controls “required prefix” length, for optimization speed vs. quality purposes. (I must stare at the current code a bit longer to understand how this cold be done.) - Break nothing in current implementation with clever defaults In longer term I plan to experiment with some other approaches to speed-up fuzzy query (e.g. creating bigram index of all tokens in the index and than searching in this bigram index for candidate tokens that should be compared. “Inverted Index inside the Index”). But, it is too early for this. Anyhow, I will do some experiments and try to learn a bit more about the Lucene, maybe then I get your point with Similarity. Thanks a lot to you both for the effort, I hope it will come out as something useful for more people than me alone. Eks --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > In fact, if you make this clean and pluggable > enough, it seems > reasonable to make this type of change to the core > (with the default > being the current Levenshtein distance formula, of > course). > > Perhaps the formula should simply bounce through > Similarity somehow so > that the computation can be centralized there > (passing the field name, > to key off that if you like)? > > Erik ____________________________________________________________ Yahoo! Messenger - Communicate instantly..."Ping" your friends today! Download Messenger Now http://uk.messenger.yahoo.com/download/index.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]