FWIW,  I integrated sourceforge's "SecondString" algos 
(http://secondstring.sourceforge.net/javadoc ) and others using a callout 
interface which boiled down to:

  float getDifference(String a, String b)

This seemed to be the cleanest lowest-common-denominator standard for plugging 
in string comparison algos where each implementation produces a score between 0 
and 1. 
I didn't find anything in the algorithm alternatives which was particularly 
faster or better than the existing edit distance impl but I don't think it 
hurts to enable pluggability. 

On a related topic - if effort is being spent refactoring FuzzyQuery I would 
re-raise the more pressing concern that the default IDF ranking in here is 
broken because it favours rare terms (ie misspellings).

See contrib/queries/....FuzzyLikeThisQuery for alternative scoring strategies 
or the thread here: 
http://www.mail-archive.com/java-user@lucene.apache.org/msg05811.html


Cheers
Mark



----- Original Message ----
From: eks dev <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, 8 June, 2006 3:12:04 PM
Subject: Re: Edit-distance strategy

what could show measurably faster results is more in Jaro distance direction, 
but even than, the fact that you need to scan all terms in dictionary will make 
it prohibitive for large colletions. For small collections this can be packed 
in some smart data structures to restrict number of distance calculations...
good luck with it

----- Original Message ----
From: eks dev <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, 8 June, 2006 4:07:07 PM
Subject: Re: Edit-distance strategy


>I'm about to replace the edit-distance algorithm in FuzzyQuery from
>Levenstein to Hirschberg to save a couple of clockticks.

Have you allready confirmed Hirschberg algorithm to be faster than current 
implementation of edit distance? I am not convinced it helps really. Hirschberg 
and standard DP algorithm both have O(|s1|*|s2|) time. The only saving is that 
Hirschberg needs O(|s2|) space using binary recursion (classic needs 
O(|s1|*|s2|) space). 




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to