I am no expert with this, but I got curious and looked at
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect. So maybe FuzzyQuery should somehow do
that? But without changing it, you could also use a query that does it
explicitly; if you get a term "foo", you could maybe search for "foo
OR foo~" ?

On Fri, Jul 8, 2022 at 4:14 PM Mike Drob <md...@mdrob.com> wrote:
>
> Hi folks,
>
> I'm working with some fuzzy queries and trying my best to understand what
> is the expected behaviour of the searcher. I'm not sure if this is a
> similarity bug or an incorrect usage on my end.
>
> The problem is when I do a fuzzy search for a term "spark~" then instead of
> matching documents with spark first, it will match other documents that
> have multiple other near terms like "spar" and "spars". I see this same
> thing with both ClassicSimilarity and BM25.
>
> This is from a much smaller (two document) index when I was trying to
> isolate and reproduce the issue, but I see comparable behaviour with more
> varied scoring on a much larger corpus. The two documents are:
>
> addDoc("spark spark", writer); // exact match
>
> addDoc("spar spars", writer); // multiple fuzzy terms
>
> The non-zero edit distance terms get a slight down-boost, but it's not
> enough to overcome their sum exceeding even the TF boost for the desired
> document.
>
> A full reproducible unit test is at
> https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546
>
> What is the recommended approach to get the document with exact term
> matching for me again? I don't see an option to tweak the internal boost
> provided by FuzzyQuery, that's one idea I had. Or is this a different
> change that needs to be fixed at the lucene level rather than application
> level?
>
> Thanks,
> Mike
>
>
>
> More detail:
>
>
> The first document with the field "spark spark" has a score explanation:
>
> 1.4054651 = sum of:
>   1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
>     1.4054651 = score(freq=2.0), product of:
>       1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>         1 = docFreq, number of documents containing term
>         2 = docCount, total number of documents with field
>       1.4142135 = tf(freq=2.0), with freq of:
>         2.0 = freq, occurrences of term within document
>       0.70710677 = fieldNorm
>
> And a document with the field "spar spars" comes in ever so slightly higher
> at
>
> 1.5404116 = sum of:
>   0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
>     0.74536043 = score(freq=1.0), product of:
>       0.75 = boost
>       1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>         1 = docFreq, number of documents containing term
>         2 = docCount, total number of documents with field
>       1.0 = tf(freq=1.0), with freq of:
>         1.0 = freq, occurrences of term within document
>       0.70710677 = fieldNorm
>   0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
>     0.79505116 = score(freq=1.0), product of:
>       0.8 = boost
>       1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>         1 = docFreq, number of documents containing term
>         2 = docCount, total number of documents with field
>       1.0 = tf(freq=1.0), with freq of:
>         1.0 = freq, occurrences of term within document
>       0.70710677 = fieldNorm

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to