Re: Fuzzy query with Jaro-Winkler distance

eks dev Thu, 22 Apr 2004 05:37:50 -0700

Sorry Erik, I did not really get this about
similarity, what/why should it be done?

How I see current implementation of the fuzzy query
goes something like this:

FuzzyTermEnum has two responsibilities:
1. To filter-out terms not similar enough to the
search term (FUZZY_THRESHOLD)
2. To actually calculate how big is the difference()

This calculated difference gets introduced into the
big picture via setBoost() in the
MultiTermQuery.rewrite(). 

Probably the way this difference is used in the
similarity calculation could be placed somewhere in
the Similarity using the other method than setBoost(),
but actual calculation of the distance is one level
lower than this. At a moment I have no better ideas
than to leave this as it is (via setBoost()). 

What Robert suggested sounds pretty reasonable to me,
with this we would be able to:
- Have different distance function for each field
- Clean way for user to implement new distance
function and than simply map it to the field name
- In the same go I would suggest FUZZY_TRESHOLD/
SCALE_FACTOR to become changeable at the field level,
now this is fixed.
- Probably introduce parameter that controls “required
prefix” length, for optimization speed vs. quality
purposes. (I must stare at the current code a bit
longer to understand how this cold be done.) 
- Break nothing in current implementation with clever
defaults

In longer term I plan to experiment with some other
approaches to speed-up fuzzy query (e.g. creating
bigram index of all tokens in the index and than
searching in this bigram index for candidate tokens
that should be compared. “Inverted Index inside the
Index”). But, it is too early for this.

Anyhow, I will do some experiments and try to learn a
bit more about the Lucene, maybe then I get your point
with Similarity. 

Thanks a lot to you both for the effort, I hope it
will come out as something useful for more people than
me alone.

Eks 

--- Erik Hatcher <[EMAIL PROTECTED]> wrote: >
In fact, if you make this clean and pluggable
> enough, it seems 
> reasonable to make this type of change to the core
> (with the default 
> being the current Levenshtein distance formula, of
> course).
> 
> Perhaps the formula should simply bounce through
> Similarity somehow so 
> that the computation can be centralized there
> (passing the field name, 
> to key off that if you like)?
> 
>       Erik

____________________________________________________________
Yahoo! Messenger - Communicate instantly..."Ping" 
your friends today! Download Messenger Now 
http://uk.messenger.yahoo.com/download/index.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fuzzy query with Jaro-Winkler distance

Reply via email to