Hello Claude, Hmm, that is interesting that you see slop=2 matching query "quick fox" against document "the fox is quick".
Edit distance (Levenshtein) is a bit tricky because it might include a transposition (just swapping the two words) as edit distance 1 OR 2. So maybe Lucene's PhraseQuery is counting transposition as edit distance 1, in which case, your test makes sense, and the javadocs are wrong? I am far from an expert on PhraseQuery :) Does anyone know if we change the behavior? In any case, we must at least fix the javadocs. Claude, maybe open a Jira issue ( https://issues.apache.org/jira/projects/LUCENE/summary) and we can discuss there? Thank you for catching this! Mike McCandless http://blog.mikemccandless.com On Fri, Dec 10, 2021 at 8:47 AM Claude Lepere <claudelep...@gmail.com> wrote: > Hello. > > > The explanation of > > https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop > < > https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop-- > > > writes > that the edit distance between "quick fox" and "the fox is quick" would be > at an edit distance of 3; > this seems inaccurate to me. > > I don't know if the edit distance used by Lucene is the Levenshtein > distance (insertion, deletion, substitution, all of weight 1) - a standard > in information retrieval - but a test of "quick fox" PhraseQuery with a > slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the > slop does not have to be 3. > > I wonder if I'm right. > > > Claude Lepère, Belgium > > claudelep...@gmail.com > > > > < > http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail > > > Virus-free. > www.avg.com > < > http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail > > > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >