Hello Claude,

Hmm, that is interesting that you see slop=2 matching query "quick fox"
against document "the fox is quick".

Edit distance (Levenshtein) is a bit tricky because it might include a
transposition (just swapping the two words) as edit distance 1 OR 2.

So maybe Lucene's PhraseQuery is counting transposition as edit distance 1,
in which case, your test makes sense, and the javadocs are wrong?

I am far from an expert on PhraseQuery :)  Does anyone know if we change
the behavior?  In any case, we must at least fix the javadocs.  Claude,
maybe open a Jira issue (
https://issues.apache.org/jira/projects/LUCENE/summary) and we can
discuss there?

Thank you for catching this!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Dec 10, 2021 at 8:47 AM Claude Lepere <claudelep...@gmail.com>
wrote:

> Hello.
>
>
> The explanation of
>
> https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop
> <
> https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--
> >
> writes
> that the edit distance between "quick fox" and "the fox is quick" would be
> at an edit distance of 3;
> this seems inaccurate to me.
>
> I don't know if the edit distance used by Lucene is the Levenshtein
> distance (insertion, deletion, substitution, all of weight 1) - a standard
> in information retrieval - but a test of "quick fox" PhraseQuery with a
> slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the
> slop does not have to be 3.
>
> I wonder if I'm right.
>
>
> Claude Lepère, Belgium
>
> claudelep...@gmail.com
>
>
>
> <
> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> Virus-free.
> www.avg.com
> <
> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

Reply via email to